CN110782490A - Video depth map estimation method and device with space-time consistency - Google Patents
Video depth map estimation method and device with space-time consistency Download PDFInfo
- Publication number
- CN110782490A CN110782490A CN201910907522.2A CN201910907522A CN110782490A CN 110782490 A CN110782490 A CN 110782490A CN 201910907522 A CN201910907522 A CN 201910907522A CN 110782490 A CN110782490 A CN 110782490A
- Authority
- CN
- China
- Prior art keywords
- depth map
- frame
- estimation
- training
- video
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/50—Depth or shape recovery
- G06T7/55—Depth or shape recovery from multiple images
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/10—Image acquisition modality
- G06T2207/10016—Video; Image sequence
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20081—Training; Learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20084—Artificial neural networks [ANN]
Landscapes
- Engineering & Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Image Analysis (AREA)
Abstract
The invention provides a video depth map estimation method and device with space-time consistency, which comprises the steps of generating a training set, wherein the training set comprises a plurality of sequences generated by taking a central frame as a target view and taking front and back frames as source views; aiming at static objects in a scene, constructing a frame for jointly training monocular depth and camera pose estimation from an unlabeled video sequence, wherein the frame comprises a depth map estimation network structure, a camera pose estimation network structure and a loss function of the part; aiming at a moving object in a scene, cascading an optical flow network behind the created framework to simulate the motion in the scene, wherein the optical flow estimation network structure is built, and a loss function of the part is built; aiming at the space-time consistency test of the depth map, a loss function of the depth neural network is provided; continuously optimizing the model, performing combined training on monocular depth and camera attitude estimation, and then training an optical flow network; and (4) utilizing the optimized model to realize the depth map estimation of the continuous video frames.
Description
Technical Field
The invention belongs to the field of understanding of geometric information of video scenes, and relates to a technology for estimating a depth map of a video frame, in particular to a technical scheme for estimating the depth map of continuous video frames with space-time consistency.
Background
Understanding 3D scene geometry in video is a fundamental problem for visual perception, which includes many basic computer vision tasks such as depth estimation, camera pose estimation, optical flow estimation, and so on. A depth map refers to an image containing information of the distance from the surface of an object in a scene to a viewpoint. Estimating depth is an important component in understanding the geometric relationships in a scene, and a general method for extracting a depth map based on an image is very necessary. The distance relationship helps to provide richer object and environment representations, and can further realize the functions such as 3D modeling, object recognition, robotics and the like. In a computer vision system, distance information provides support for various computer vision practical applications such as image segmentation, target detection, object tracking, three-dimensional reconstruction and the like.
The existing depth map estimation method mainly comprises a manual scanning acquisition method by utilizing physical equipment, a traditional mathematical method, a supervised deep learning method and an unsupervised deep learning method. These several methods have some drawbacks: the equipment scanning method mainly utilizes physical equipment to carry out manual scanning acquisition, but the existing three-dimensional scanner (such as Kinect) is not only expensive in manufacturing cost, but also not suitable for general application scenes; the depth estimation precision of the traditional mathematical method is too low, and for some complex scenes, the method can not perform an effective treatment generally; the supervised deep learning method mainly depends on deep learning, a network architecture and a mathematical model to obtain results, the method generally has strong dependence on a data set, the acquisition of the data set generally needs to consume a large amount of manpower and material resources, and the method is generally poor in generalization; the unsupervised depth learning method and the existing video depth estimation method usually ignore the problem of spatial and temporal discontinuity of a depth map, and a large error is often generated in the actual processing process of some occlusion areas or non-Lambert surface areas.
Disclosure of Invention
The invention provides a technical scheme for depth estimation of continuous video frames with space-time consistency in order to overcome the defects of the existing method, so that the estimated depth map result can obtain clearer details in some areas, and meanwhile, the time continuity between different video frames is enhanced, so that the final result is more accurate.
The technical scheme of the invention provides a video depth map estimation method with space-time consistency, which comprises the following steps,
step 2, constructing an unmarked video sequence joint training monocular depth and camera pose estimation framework aiming at static objects in a scene, wherein the framework comprises a depth map estimation network structure, a camera pose estimation network structure and a loss function of the part;
step 3, aiming at a moving object in the scene, cascading an optical flow network to simulate the motion in the scene after the frame created in the step 2, wherein the optical flow estimation network structure is built, and a loss function of the part is built;
step 4, aiming at the space-time consistency test of the depth map, a loss function of the depth neural network is provided;
step 5, optimizing the model, including performing joint training on monocular depth and camera attitude estimation, and then training the rest optical flow network on the basis; and (4) utilizing the optimized model to realize the depth map estimation of the continuous video frames.
In step 2, a depth map estimation network and an optical flow estimation network, which are composed of an encoder and a decoder, are used, and multi-scale depth prediction is performed by using cross-layer connection.
And in step 2, the unmarked video is used for carrying out unsupervised training, including training by combining the geometric characteristics of the moving three-dimensional scenes, combining the training into image synthesis loss, and carrying out unsupervised learning training on the static scenes and the dynamic scenes in the images respectively by using the image similarity as the supervision.
In step 4, a spatial consistency loss is proposed, and the difference of the flow values from the t frame image to the t +1 frame image and from the t +1 frame image to the t frame image is restrained; a temporal consistency loss is proposed, adding a difference constraint on the t-frame to t +1 frame image stream values and the stream values directly from the t-1 frame to the t +1 frame to the t-1 frame stream values.
The invention also provides a corresponding device for realizing the video depth map estimation method with space-time consistency.
The invention has the following advantages: 1. the invention can obtain a video depth estimation technical scheme with more generalization. 2. The invention provides space-time consistency check, provides a new loss function, increases the relevance of depth maps of different video frames, and solves the problem of overlarge error before and after the depth map result of continuous video frames. 3. The depth map estimation results of some low-texture, three-dimensional blur, occlusion and other areas in the scene are improved, so that the accuracy of the overall depth map estimation result is improved.
Drawings
Fig. 1 is an overall flowchart framework diagram of a video depth map estimation method with spatio-temporal consistency according to an embodiment of the present invention.
FIG. 2 is an overall framework diagram for jointly training monocular depth and camera pose estimates from unlabeled video sequences in accordance with an embodiment of the present invention.
Detailed Description
The technical solution of the present invention is described in detail below with reference to the accompanying drawings and examples.
The invention provides a method for estimating a video depth map, which combines depth estimation, optical flow estimation and camera pose estimation together through the geometric characteristics of a moving three-dimensional scene for training, combines the geometric characteristics into image synthesis loss, respectively carries out unsupervised learning training on static and dynamic scenes in an image by using image similarity as supervision, and simultaneously provides a new loss function improvement effect aiming at the problem of discontinuous depth space time frequently occurring in the video depth map estimation. Referring to fig. 1, a method for estimating a video depth map with spatiotemporal consistency according to an embodiment of the present invention includes the following steps:
by using a kitti data set commonly used in the field of video depth estimation, the kitti data set is currently applied to a computer vision image data set in an automatic driving scene, including urban, rural, road and other scenes, and the images contain at most more than ten vehicles and thirty pedestrians, and also contain various environments such as occlusion, motion and the like, so that the computer vision image data set has rich image information. The specific processing is to fix the length of the image sequence to 3 frames, take the central frame as the target view, and take ± 1 frame (i.e. two frames before and after) as the source view. Using images taken in the Kitti dataset, a total of 12000 sequences were obtained, of which 10800 were used for training and 1200 for validation.
And 2, constructing a framework for jointly training monocular depth and camera attitude estimation from the unlabeled video sequence.
A framework for jointly training monocular depth and camera pose estimation from unlabeled video sequences is constructed for static objects in a scene. The key supervisory signals of the depth and camera pose prediction convolutional neural network in the step come from the task of view synthesis: given an input view of a scene, new images of the scene are synthesized as seen from different camera poses.
The invention preferably adopts a depth map estimation network and an optical flow estimation network which are composed of an encoder and a decoder, and adopts a cross-layer connection idea to carry out multi-scale depth prediction, thereby improving the operation efficiency and the accuracy of the result.
The invention proposes to use unmarked video for unsupervised training: the geometric characteristics of the moving three-dimensional scenes are combined together for training, the geometric characteristics are combined into image synthesis loss, and the image similarity is used as supervision to respectively perform unsupervised learning training on the static scenes and the dynamic scenes in the images. A large amount of manpower and material resources are saved, so that the invention has greater universality.
Referring to fig. 2, the implementation of step 2 of the example is illustrated as follows:
(1) and constructing a depth map estimation network structure.
Since the depth map estimation network needs to train and calculate the geometric relationship at the pixel level, the depth map network mainly consists of two parts, namely an encoder and a decoder, and the specific network structure of the encoder and the decoder is shown in table 1 and table 2. The encoder portion uses convolutional layers as a more efficient learning means. The decoder consists of an deconvolution layer, which maps the spatial features up to the full scale of the input. In order to simultaneously reserve global high-level features and local detail information, the idea of cross-layer connection is used for reference between an encoder and a decoder, and multi-scale depth prediction is carried out.
TABLE 1 encoder network architecture
TABLE 2 decoder network architecture
Layer, Conv1, Covn1b, Conv2, Covn2b …, Conv7 and Covn7b are convolutional layers, Disp1 and Disp2 … Disp4 are connected across layers, Icovn1 … Icovn7 and upcon 1 … upcon v7 are deconvolution layers, k is the kernel size, s is the step size, chns is the number of input and output channels per layer, the input and output are the reduction factor of each layer relative to the input image (i.e. in is the inverse ratio of the input to the size of the input, and the original is the size ratio of the output), input corresponds to the input of each layer, where + is the series, and is 2 times the upsampling of the layer.
The network structure is divided into 6 scales, the maximum scale is the scale of the original image, then the size of each scale is one half of the previous scale, the resolution of the feature map of the layer with the minimum scale is only sixty-fourth of the original resolution, but the number of channels is as high as 512. The down-sampling operation is performed using a maximum pooling method in the encoder portion and the up-sampling operation is performed using a deconvolution layer in the decoder portion. The output of the encoder part is transmitted to a decoder of a corresponding scale by cross-layer transmission at each scale, and after the output characteristic diagram is connected with the decoder of the corresponding scale, a new characteristic diagram is synthesized to be used as input to be transmitted into a corresponding deconvolution layer.
(2) And (5) building a camera pose estimation network structure.
The camera pose estimation network structure regresses the camera pose (Euler angle and translation vector of camera rotation), the main structure of the camera pose network is similar to that of the encoder of the network in the step (1), a global average pooling layer, POOL, and finally a prediction layer Softmax are connected behind 8 convolutional layers, and the specific network structure is shown in a table 3. Except for the last predicted layer, the Batch Normalization and activation Relus functions are used for all layers.
TABLE 3
Of these, 8 convolutional layers were designated as Conv1, Conv1b, Conv2, Conv2b, Conv3, Conv3b, Conv4, and Conv4b, respectively, Fc1, and Fc2 were all tie layers.
(3) A loss function for the portion is constructed.
The deep network only connects the target view I
tAs input, and outputs a per-pixel depth map D
t. Camera pose network views (I) of objects
t) And adjacent source views (e.g. I)
t+1) As input, and output relative camera pose
The outputs of the two networks are then used to reverse warp the source view to reconstruct the target view, and the photometric error is used to train the convolutional neural network. By using view synthesis as a supervision, this framework can be trained from video in an unmarked supervised manner.
For the invention<I
1.....I
n>N is the number of picture frames as a representation of the training image sequence, and n pictures I are in total
1.....I
n. n is the number of pictures of the entire data set, but each calculation is calculated as three consecutive frames. The specific implementation can also be carried out by three frames at a timeThe above calculation is performed, but the amount of calculation increases each time.
Selecting one frame I
tAs a target view, the rest is a source view I
s(s is more than or equal to 1 and less than or equal to n, and s is not equal to t). The supervisory signal may be expressed as
Wherein p is the index pixel coordinate,
representing a slave view frame I
sThe composite view of the predicted target frame is obtained from the rigid-body flow, and the rig flag represents that part only takes into account static rigid objects. Therefore, the supervisory signal at this stage is from the minimized view synthesis
And the original frame I
tThe difference between them. I is
t(p) is a representative point p in Picture I
tThe position in the frame of the image data,
for the position of the p-point in the image, L, calculated by rigid flow
rsFor their difference, the present invention requires L to be applied during the training process
rsAs small as possible.
A key component of this framework is a differentiable depth image-based renderer that reconstructs the target view by sampling pixels from the source view. Prediction-based depth map
And relative posture
Let P
tRepresenting the homogeneous coordinates of the pixels in the target view, K representing the camera intrinsic matrix, P can be obtained by the formula
tTo the source view P
sThe above.
It is to be noted that the projection coordinate P
sAre continuous values. To obtain filling
I of value of
s(p
s),
Depth map of p points at t frame, I
s(p
s) Is the position of p in the s-frame,
based on the predicted position of a point p in a t-frame picture, and then linearly interpolating a value p of 4 pixels using a differentiable bilinear sampling mechanism
sOf (2)
(upper left, upper right, lower left and lower right) is approximately I
s(p
s) I.e. by
Wherein, w
ijAnd P
sAnd
the spatial proximity therebetween is linearly proportional, and
the bilinear interpolation method is to linearly interpolate a value p of 4 pixels
sIs approximated as I (upper left, upper right, lower left and lower right)
s(p
s). t, b, l and r represent upper, lower, left and right, respectively, w
ijIs the proportionality coefficient occupied by each point. The coordinates of the pixel deformation obtained here can be decomposed in depth and camera pose by projection geometry.
The differentiable bilinear sampling mechanism is the prior art, namely a bilinear differential interpolation method, and the details of the invention are omitted.
And 3, after the frame created in the step 2, cascading an optical flow network to simulate the motion in the scene.
In the step 2, moving objects in the scene are ignored, and certain compensation correction can be effectively carried out on the depth map of the moving objects after the optical flow estimation network is added, so that the accuracy of the result is improved.
The specific implementation process is described as follows:
(1) and constructing an optical flow estimation network structure. The remaining non-rigid flows are learned with the optical flow network, the displacements of which are caused only by the relative motion of the objects and the world scene. The framework of the optical flow estimation network structure is similar to the depth map estimation network structure in step 2 (the same network structure as tables 1 and 2 can be used because they all obtain the same resolution as the input picture at the end), and also consists of two parts, an encoder and a decoder. The optical flow network is connected in a cascaded manner after the first stage of networks. For a given pair of image frames, the optical flow network uses the output of the network in step 2
Predicting corresponding residual streams
The final overall prediction stream is
Is composed of
The input of the optical flow network Is composed of several images connected in the channel dimension, including the source frame and target frame image pairs Is and Id, and the output of the network in step 2
Composite views
And
with the original image I
sThe error of (2).
(2) A loss function for the portion is constructed. The supervision in step 2 is extended to the present stage by slight modifications (introducing the influence of the optical flow component on the scene flow). In step 2, the static scene is mainly processed, and the processing for the moving object is ignored. To improve the robustness of the learning process to these factors, a solution to incorporate an optical flow network to train the remaining flows (optical flow portion) except the rigid flow has been proposed for this problem. In particular, over the entire prediction stream
Then, image warping (image warping) is performed between any pair of the target frame and the source frame
Instead of the former
Thus obtaining a warp loss L of the whole flow
fs. The concrete formula is expressed as
Wherein the content of the first and second substances,
the position of the p-point in the image is calculated for the overall flow.
And 4, providing a loss function of the deep neural network.
Aiming at the space-time consistency test of the depth map, a loss function of the depth neural network is provided, the overlarge error before and after the result of continuous video frames is prevented, and meanwhile, the estimation results of some areas such as low texture, three-dimensional blur, occlusion and the like in a scene are improved to a certain extent.
The specific implementation process is described as follows:
the spatial consistency loss provided by the invention is realized by restricting the difference of flow values from a t frame image to a t +1 frame image and from a t +1 frame image to a t frame image, and the temporal consistency loss is realized by adding the difference restriction of the flow values from the t frame to the t +1 frame image and the flow values from the t-1 frame to the t +1 frame directly, wherein the specific formula is shown as the formula:
wherein the content of the first and second substances,
for the position of p-point at t-frame calculated by the overall stream of p-points from s-frame to t-frame, I
s(p) is the position of the p point in the s frame image,
for the entire stream of t-1 frames to t frames,
for the entire stream of t frames to t +1 frames,
the overall stream is from t-1 frames to t +1 frames. L is
ftFor differences between the stream values from t-frame to s-frame and the stream values from s-frame to t-frame, L
fpIs the difference between the stream value from the t-1 frame to the t frame plus the stream value from the t frame to the t +1 frame and the stream value from the t-1 frame directly to the t +1 frame. Ideally these two values should be as small as possible so they are used as a loss function to train the network.
Pixels that flow severely contradictory (i.e., too much computational error) are considered to be possible outliers. Since these regions violate the assumption of image consistency and geometric consistency, this document only goes throughThey are handled with excessive loss of smoothness. Thus, the full flow warp loss L herein
fsAnd loss of spatio-temporal consistency L
ft、L
fpAre weighted by pixel.
And 5, setting training parameters of the network, and continuously optimizing the model according to the error of each generation. In the training process, the set loss function is required to be continuously reduced in an iterative manner, so that the more accurate the model is. And the depth map estimation of continuous video frames can be realized by utilizing the optimized model.
In specific implementation, the monocular depth and the camera pose estimation can be trained jointly, and then the residual optical flow network is trained on the basis. And finally, obtaining the trained network models of depth map estimation, camera pose estimation and optical flow estimation.
The specific implementation process is described as follows:
the invention mainly comprises three sub-networks, namely a depth map estimation network and a camera pose estimation network, which form the reconstruction of a static object together, and the optical flow estimation network structure is combined with the output of the previous stage to realize the positioning of a moving object. Although the network can be trained together in an end-to-end fashion, there is no guarantee that local gradient optimization will bring the network to an optimal point. Therefore, a segmented training strategy is employed while reducing memory and computation consumption. Firstly, training a depth map estimation network and a camera pose estimation network, determining weights, and then training an optical flow estimation network. The resolution of the trained input images is resize to 128 x 416, while random upscaling, cropping, recoloring, etc. methods are also used to prevent overfitting. The network optimization function adopts a common neural network optimization method Adam. The initial learning rate was set to 0.0002 and the mini-batch size (minimum batch size) was set to 4. The first and second stages of the training process converge with 30 and 200 epochs (iterations), respectively. Testing on the KITTI data set it should be understood that parts not elaborated on in this specification are prior art.
In the above process, the main characteristics are: and (3) providing time consistency check of the depth map, improving a loss function of the depth neural network, constructing the time consistency check specially aiming at the video depth map in a deep learning model, improving the overall loss function, and preventing overlarge errors before and after the result of continuous video frames. Meanwhile, the estimation results of some areas such as low texture, three-dimensional blur, occlusion and the like in the scene are improved to a certain extent.
In specific implementation, the automatic operation of the process can be realized by adopting a software mode. The apparatus for operating the process should also be within the scope of the present invention.
It should be understood that the above description of the preferred embodiments is given for clarity and not for any purpose of limitation, and that various changes, substitutions and alterations can be made herein without departing from the spirit and scope of the invention as defined by the appended claims.
Claims (5)
1. A video depth map estimation method with space-time consistency is characterized in that: comprises the following steps of (a) carrying out,
step 1, generating a training set, wherein the length of an image sequence is fixed to be 3 frames, a central frame is used as a target view, two frames in front of and behind the central frame are used as source views, and a plurality of sequences are generated;
step 2, constructing an unmarked video sequence joint training monocular depth and camera pose estimation framework aiming at static objects in a scene, wherein the framework comprises a depth map estimation network structure, a camera pose estimation network structure and a loss function of the part;
step 3, aiming at a moving object in the scene, cascading an optical flow network to simulate the motion in the scene after the frame created in the step 2, wherein the optical flow estimation network structure is built, and a loss function of the part is built;
step 4, aiming at the space-time consistency test of the depth map, a loss function of the depth neural network is provided;
step 5, optimizing the model, including performing joint training on monocular depth and camera attitude estimation, and then training the rest optical flow network on the basis; and (4) utilizing the optimized model to realize the depth map estimation of the continuous video frames.
2. The method of claim 1, wherein the video depth map estimation method with spatiotemporal consistency is characterized in that: in step 2, a depth map estimation network and an optical flow estimation network which are composed of an encoder and a decoder are adopted, and cross-layer connection is adopted to carry out multi-scale depth prediction.
3. The method of claim 1, wherein the video depth map estimation method with spatiotemporal consistency is characterized in that: and 2, performing unsupervised training by using the unmarked video, wherein the unsupervised training comprises training by combining the geometric characteristics of the moving three-dimensional scene, combining the training into image synthesis loss, and performing unsupervised learning training on the static scene and the dynamic scene in the image by using the image similarity as a monitor.
4. A method for estimating a video depth map with spatio-temporal consistency according to claim 1, 2 or 3, characterized in that: step 4, proposing a space consistency loss, and constraining the difference of the flow values from the t frame image to the t +1 frame image and from the t +1 frame image to the t frame image; a temporal consistency loss is proposed, adding a difference constraint on the t-frame to t +1 frame image stream values and the stream values directly from the t-1 frame to the t +1 frame to the t-1 frame stream values.
5. An apparatus for estimating a video depth map with spatio-temporal consistency, characterized in that: for implementing a video depth map estimation method with spatio-temporal consistency as claimed in claims 1 to 4.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910907522.2A CN110782490B (en) | 2019-09-24 | 2019-09-24 | Video depth map estimation method and device with space-time consistency |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910907522.2A CN110782490B (en) | 2019-09-24 | 2019-09-24 | Video depth map estimation method and device with space-time consistency |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110782490A true CN110782490A (en) | 2020-02-11 |
CN110782490B CN110782490B (en) | 2022-07-05 |
Family
ID=69383733
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910907522.2A Active CN110782490B (en) | 2019-09-24 | 2019-09-24 | Video depth map estimation method and device with space-time consistency |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110782490B (en) |
Cited By (17)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111311664A (en) * | 2020-03-03 | 2020-06-19 | 上海交通大学 | Joint unsupervised estimation method and system for depth, pose and scene stream |
CN111402310A (en) * | 2020-02-29 | 2020-07-10 | 同济大学 | Monocular image depth estimation method and system based on depth estimation network |
CN111583305A (en) * | 2020-05-11 | 2020-08-25 | 北京市商汤科技开发有限公司 | Neural network training and motion trajectory determination method, device, equipment and medium |
CN111709982A (en) * | 2020-05-22 | 2020-09-25 | 浙江四点灵机器人股份有限公司 | Three-dimensional reconstruction method for dynamic environment |
CN112085717A (en) * | 2020-09-04 | 2020-12-15 | 厦门大学 | Video prediction method and system for laparoscopic surgery |
CN112270691A (en) * | 2020-10-15 | 2021-01-26 | 电子科技大学 | Monocular video structure and motion prediction method based on dynamic filter network |
CN112344922A (en) * | 2020-10-26 | 2021-02-09 | 中国科学院自动化研究所 | Monocular vision odometer positioning method and system |
CN112801074A (en) * | 2021-04-15 | 2021-05-14 | 速度时空信息科技股份有限公司 | Depth map estimation method based on traffic camera |
CN113222895A (en) * | 2021-04-10 | 2021-08-06 | 河南巨捷电子科技有限公司 | Electrode defect detection method and system based on artificial intelligence |
CN113284173A (en) * | 2021-04-20 | 2021-08-20 | 中国矿业大学 | End-to-end scene flow and pose joint learning method based on pseudo laser radar |
CN114359363A (en) * | 2022-01-11 | 2022-04-15 | 浙江大学 | Video consistency depth estimation method and device based on deep learning |
CN114663347A (en) * | 2022-02-07 | 2022-06-24 | 中国科学院自动化研究所 | Unsupervised object instance detection method and unsupervised object instance detection device |
CN114937125A (en) * | 2022-07-25 | 2022-08-23 | 深圳大学 | Reconstructable metric information prediction method, reconstructable metric information prediction device, computer equipment and storage medium |
CN115131404A (en) * | 2022-07-01 | 2022-09-30 | 上海人工智能创新中心 | Monocular 3D detection method based on motion estimation depth |
WO2022206020A1 (en) * | 2021-03-31 | 2022-10-06 | 中国科学院深圳先进技术研究院 | Method and apparatus for estimating depth of field of image, and terminal device and storage medium |
CN115187638A (en) * | 2022-09-07 | 2022-10-14 | 南京逸智网络空间技术创新研究院有限公司 | Unsupervised monocular depth estimation method based on optical flow mask |
CN117115786A (en) * | 2023-10-23 | 2023-11-24 | 青岛哈尔滨工程大学创新发展中心 | Depth estimation model training method for joint segmentation tracking and application method |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20120127267A1 (en) * | 2010-11-23 | 2012-05-24 | Qualcomm Incorporated | Depth estimation based on global motion |
CN103002309A (en) * | 2012-09-25 | 2013-03-27 | 浙江大学 | Depth recovery method for time-space consistency of dynamic scene videos shot by multi-view synchronous camera |
CN105100771A (en) * | 2015-07-14 | 2015-11-25 | 山东大学 | Single-viewpoint video depth obtaining method based on scene classification and geometric dimension |
CN106599805A (en) * | 2016-12-01 | 2017-04-26 | 华中科技大学 | Supervised data driving-based monocular video depth estimating method |
CN106612427A (en) * | 2016-12-29 | 2017-05-03 | 浙江工商大学 | Method for generating spatial-temporal consistency depth map sequence based on convolution neural network |
CN107274445A (en) * | 2017-05-19 | 2017-10-20 | 华中科技大学 | A kind of image depth estimation method and system |
CN107481279A (en) * | 2017-05-18 | 2017-12-15 | 华中科技大学 | A kind of monocular video depth map computational methods |
-
2019
- 2019-09-24 CN CN201910907522.2A patent/CN110782490B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20120127267A1 (en) * | 2010-11-23 | 2012-05-24 | Qualcomm Incorporated | Depth estimation based on global motion |
CN103002309A (en) * | 2012-09-25 | 2013-03-27 | 浙江大学 | Depth recovery method for time-space consistency of dynamic scene videos shot by multi-view synchronous camera |
CN105100771A (en) * | 2015-07-14 | 2015-11-25 | 山东大学 | Single-viewpoint video depth obtaining method based on scene classification and geometric dimension |
CN106599805A (en) * | 2016-12-01 | 2017-04-26 | 华中科技大学 | Supervised data driving-based monocular video depth estimating method |
CN106612427A (en) * | 2016-12-29 | 2017-05-03 | 浙江工商大学 | Method for generating spatial-temporal consistency depth map sequence based on convolution neural network |
CN107481279A (en) * | 2017-05-18 | 2017-12-15 | 华中科技大学 | A kind of monocular video depth map computational methods |
CN107274445A (en) * | 2017-05-19 | 2017-10-20 | 华中科技大学 | A kind of image depth estimation method and system |
Non-Patent Citations (4)
Title |
---|
ANTONIO W. VIEIRA等: "STOP: Space-Time Occupancy Patterns for 3D Action Recognition from Depth Map Sequences", 《PROGRESS IN PATTERN RECOGNITION, IMAGE ANALYSIS, COMPUTER VISION, AND APPLICATIONS》, 30 September 2012 (2012-09-30), pages 1 - 8 * |
TAK-WAI HUI等: "Dense depth map generation using sparse depth data from normal flow", 《2014 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP)》, 29 January 2015 (2015-01-29), pages 3837 - 3841 * |
姜翰青等: "基于多个手持摄像机的动态场景时空一致性深度恢复", 《计算机辅助设计与图像学学报》, vol. 25, no. 2, 28 February 2013 (2013-02-28), pages 137 - 145 * |
葛利跃等: "深度图像优化分层分割的3D场景流估计", 《南昌航空大学学报自然科学版》, vol. 32, no. 2, 30 June 2018 (2018-06-30), pages 17 - 25 * |
Cited By (27)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111402310A (en) * | 2020-02-29 | 2020-07-10 | 同济大学 | Monocular image depth estimation method and system based on depth estimation network |
CN111402310B (en) * | 2020-02-29 | 2023-03-28 | 同济大学 | Monocular image depth estimation method and system based on depth estimation network |
CN111311664A (en) * | 2020-03-03 | 2020-06-19 | 上海交通大学 | Joint unsupervised estimation method and system for depth, pose and scene stream |
CN111311664B (en) * | 2020-03-03 | 2023-04-21 | 上海交通大学 | Combined unsupervised estimation method and system for depth, pose and scene flow |
CN111583305A (en) * | 2020-05-11 | 2020-08-25 | 北京市商汤科技开发有限公司 | Neural network training and motion trajectory determination method, device, equipment and medium |
CN111709982B (en) * | 2020-05-22 | 2022-08-26 | 浙江四点灵机器人股份有限公司 | Three-dimensional reconstruction method for dynamic environment |
CN111709982A (en) * | 2020-05-22 | 2020-09-25 | 浙江四点灵机器人股份有限公司 | Three-dimensional reconstruction method for dynamic environment |
CN112085717A (en) * | 2020-09-04 | 2020-12-15 | 厦门大学 | Video prediction method and system for laparoscopic surgery |
CN112085717B (en) * | 2020-09-04 | 2024-03-19 | 厦门大学 | Video prediction method and system for laparoscopic surgery |
CN112270691A (en) * | 2020-10-15 | 2021-01-26 | 电子科技大学 | Monocular video structure and motion prediction method based on dynamic filter network |
CN112270691B (en) * | 2020-10-15 | 2023-04-21 | 电子科技大学 | Monocular video structure and motion prediction method based on dynamic filter network |
CN112344922A (en) * | 2020-10-26 | 2021-02-09 | 中国科学院自动化研究所 | Monocular vision odometer positioning method and system |
WO2022206020A1 (en) * | 2021-03-31 | 2022-10-06 | 中国科学院深圳先进技术研究院 | Method and apparatus for estimating depth of field of image, and terminal device and storage medium |
CN113222895A (en) * | 2021-04-10 | 2021-08-06 | 河南巨捷电子科技有限公司 | Electrode defect detection method and system based on artificial intelligence |
CN112801074B (en) * | 2021-04-15 | 2021-07-16 | 速度时空信息科技股份有限公司 | Depth map estimation method based on traffic camera |
CN112801074A (en) * | 2021-04-15 | 2021-05-14 | 速度时空信息科技股份有限公司 | Depth map estimation method based on traffic camera |
CN113284173B (en) * | 2021-04-20 | 2023-12-19 | 中国矿业大学 | End-to-end scene flow and pose joint learning method based on false laser radar |
CN113284173A (en) * | 2021-04-20 | 2021-08-20 | 中国矿业大学 | End-to-end scene flow and pose joint learning method based on pseudo laser radar |
CN114359363A (en) * | 2022-01-11 | 2022-04-15 | 浙江大学 | Video consistency depth estimation method and device based on deep learning |
CN114663347A (en) * | 2022-02-07 | 2022-06-24 | 中国科学院自动化研究所 | Unsupervised object instance detection method and unsupervised object instance detection device |
CN115131404A (en) * | 2022-07-01 | 2022-09-30 | 上海人工智能创新中心 | Monocular 3D detection method based on motion estimation depth |
CN115131404B (en) * | 2022-07-01 | 2024-06-14 | 上海人工智能创新中心 | Monocular 3D detection method based on motion estimation depth |
CN114937125A (en) * | 2022-07-25 | 2022-08-23 | 深圳大学 | Reconstructable metric information prediction method, reconstructable metric information prediction device, computer equipment and storage medium |
CN115187638A (en) * | 2022-09-07 | 2022-10-14 | 南京逸智网络空间技术创新研究院有限公司 | Unsupervised monocular depth estimation method based on optical flow mask |
WO2024051184A1 (en) * | 2022-09-07 | 2024-03-14 | 南京逸智网络空间技术创新研究院有限公司 | Optical flow mask-based unsupervised monocular depth estimation method |
CN117115786A (en) * | 2023-10-23 | 2023-11-24 | 青岛哈尔滨工程大学创新发展中心 | Depth estimation model training method for joint segmentation tracking and application method |
CN117115786B (en) * | 2023-10-23 | 2024-01-26 | 青岛哈尔滨工程大学创新发展中心 | Depth estimation model training method for joint segmentation tracking and application method |
Also Published As
Publication number | Publication date |
---|---|
CN110782490B (en) | 2022-07-05 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110782490B (en) | Video depth map estimation method and device with space-time consistency | |
CN111739078B (en) | Monocular unsupervised depth estimation method based on context attention mechanism | |
CN111105432B (en) | Unsupervised end-to-end driving environment perception method based on deep learning | |
TWI739151B (en) | Method, device and electronic equipment for image generation network training and image processing | |
CN111354030B (en) | Method for generating unsupervised monocular image depth map embedded into SENet unit | |
CN113077505B (en) | Monocular depth estimation network optimization method based on contrast learning | |
CN110910437B (en) | Depth prediction method for complex indoor scene | |
CN109903315B (en) | Method, apparatus, device and readable storage medium for optical flow prediction | |
CN111325784A (en) | Unsupervised pose and depth calculation method and system | |
CN113850900B (en) | Method and system for recovering depth map based on image and geometric clues in three-dimensional reconstruction | |
CN110942484B (en) | Camera self-motion estimation method based on occlusion perception and feature pyramid matching | |
CN112270692B (en) | Monocular video structure and motion prediction self-supervision method based on super-resolution | |
CN113554032B (en) | Remote sensing image segmentation method based on multi-path parallel network of high perception | |
CN115187638B (en) | Unsupervised monocular depth estimation method based on optical flow mask | |
CN114170286B (en) | Monocular depth estimation method based on unsupervised deep learning | |
CN115035171A (en) | Self-supervision monocular depth estimation method based on self-attention-guidance feature fusion | |
CN115294282A (en) | Monocular depth estimation system and method for enhancing feature fusion in three-dimensional scene reconstruction | |
CN111899295A (en) | Monocular scene depth prediction method based on deep learning | |
CN116205962B (en) | Monocular depth estimation method and system based on complete context information | |
CN115546273A (en) | Scene structure depth estimation method for indoor fisheye image | |
CN117593702A (en) | Remote monitoring method, device, equipment and storage medium | |
CN113610912A (en) | System and method for estimating monocular depth of low-resolution image in three-dimensional scene reconstruction | |
CN115578260A (en) | Attention method and system for direction decoupling for image super-resolution | |
CN115131418A (en) | Monocular depth estimation algorithm based on Transformer | |
CN115272450A (en) | Target positioning method based on panoramic segmentation |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |