WO2021114870A1 - 视差估计***、方法、电子设备及计算机可读存储介质 - Google Patents

视差估计***、方法、电子设备及计算机可读存储介质 Download PDF

Info

Publication number
WO2021114870A1
WO2021114870A1 PCT/CN2020/121824 CN2020121824W WO2021114870A1 WO 2021114870 A1 WO2021114870 A1 WO 2021114870A1 CN 2020121824 W CN2020121824 W CN 2020121824W WO 2021114870 A1 WO2021114870 A1 WO 2021114870A1
Authority
WO
WIPO (PCT)
Prior art keywords
disparity
image
processing
level
size
Prior art date
Application number
PCT/CN2020/121824
Other languages
English (en)
French (fr)
Inventor
方舒
周骥
冯歆鹏
Original Assignee
上海肇观电子科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 上海肇观电子科技有限公司 filed Critical 上海肇观电子科技有限公司
Priority to US17/127,540 priority Critical patent/US11158077B2/en
Publication of WO2021114870A1 publication Critical patent/WO2021114870A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/50Depth or shape recovery
    • G06T7/55Depth or shape recovery from multiple images
    • G06T7/593Depth or shape recovery from multiple images from stereo images
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/50Depth or shape recovery
    • G06T7/55Depth or shape recovery from multiple images
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T3/00Geometric image transformations in the plane of the image
    • G06T3/40Scaling of whole images or parts thereof, e.g. expanding or contracting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • G06T7/13Edge detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/30Determination of transform parameters for the alignment of images, i.e. image registration
    • G06T7/33Determination of transform parameters for the alignment of images, i.e. image registration using feature-based methods
    • G06T7/337Determination of transform parameters for the alignment of images, i.e. image registration using feature-based methods involving reference images or patches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/60Analysis of geometric attributes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20016Hierarchical, coarse-to-fine, multiscale or multiresolution image processing; Pyramid transform
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20228Disparity calculation for image-based rendering

Definitions

  • the present disclosure relates to the field of computer vision technology, and in particular to a disparity estimation system, method, electronic device, and computer-readable storage medium.
  • computer vision technology may be used to obtain the disparity between each pair of matching pixels in two images of the same scene with different viewing angles, to obtain a disparity map, and to obtain the depth information of the scene based on the disparity map , Wherein the depth information can be used in various fields such as three-dimensional reconstruction, automatic driving, and obstacle detection.
  • the method of obtaining disparity by using computer vision technology may include a local area matching method, a global optimization method, a semi-global method, a method based on a convolutional neural network, and so on.
  • a disparity estimation system including: a feature extraction network configured to perform feature extraction on each image in an image pair, and output the extracted image features to the disparity generation network; and
  • the disparity generation network is configured to perform cascaded multi-level disparity processing according to the extracted image features to obtain multiple disparity maps with successively increasing sizes, wherein the first-level disparity in the multi-level disparity processing
  • the processed input includes a plurality of image features having sizes corresponding to the level of parallax processing; the input of each level of parallax processing except for the first level of parallax processing in the multi-level parallax processing includes: one or more An image feature with a size corresponding to this level of parallax processing, and a parallax map generated by the previous level of parallax processing.
  • a disparity estimation method including: performing feature extraction on each image in an image pair; and performing cascaded multi-level disparity processing based on the extracted image features to obtain multiple sizes in sequence An enlarged disparity map, wherein the input of the first-level disparity processing in the multi-level disparity processing includes a plurality of image features having a size corresponding to the disparity processing in the multi-level disparity processing; The input of each level of parallax processing other than the first level of parallax processing includes: one or more image features having a size corresponding to the level of parallax processing, and a parallax map generated by the previous level of parallax processing.
  • an electronic device including: a processor; and a memory storing a program, the program including instructions that, when executed by the processor, cause the processor to execute the present disclosure The method described in.
  • a computer-readable storage medium storing a program, the program including instructions that, when executed by a processor of an electronic device, cause the electronic device to execute the Methods.
  • Fig. 1 is a structural block diagram showing a disparity estimation system according to an exemplary embodiment of the present disclosure
  • FIG. 2 is a schematic diagram showing the basic structure characteristics of an image according to an exemplary embodiment of the present disclosure
  • Fig. 3 is a schematic diagram showing semantic features of an image according to an exemplary embodiment of the present disclosure
  • Fig. 4 is a schematic diagram showing edge features of an image according to an exemplary embodiment of the present disclosure
  • Fig. 5 is a block diagram showing a possible overall structure of a disparity estimation system according to an exemplary embodiment of the present disclosure
  • Fig. 6 is a block diagram showing another possible overall structure of a disparity estimation system according to an exemplary embodiment of the present disclosure
  • FIGS. 7A and 7B are respectively schematic diagrams showing a reference image on which a network training according to an exemplary embodiment of the present disclosure is based and a corresponding true-value disparity map;
  • FIG. 8 is a diagram showing a plurality of successively increasing sizes from right to left obtained by using a trained disparity estimation system to cascade the reference image shown in FIG. 7A in multi-stage disparity processing according to an exemplary embodiment of the present disclosure.
  • FIG. 9 is a flowchart showing a disparity estimation method according to an exemplary embodiment of the present disclosure.
  • FIG. 10 is a flowchart illustrating multi-level disparity processing in a disparity estimation method according to an exemplary embodiment of the present disclosure
  • FIG. 11 is a structural block diagram showing an exemplary computing device that can be applied to an exemplary embodiment of the present disclosure.
  • first, second, etc. to describe various elements is not intended to limit the positional relationship, timing relationship, or importance relationship of these elements. Such terms are only used for Distinguish one element from another.
  • first element and the second element may refer to the same instance of the element, and in some cases, based on the description of the context, they may also refer to different instances.
  • computer vision technology may be used to obtain the disparity between each pair of matching pixels in two images of the same scene with different viewing angles, to obtain a disparity map, and to obtain the depth information of the scene based on the disparity map , Wherein the depth information can be used in various fields such as three-dimensional reconstruction, automatic driving, and obstacle detection.
  • the method of using computer vision technology to obtain the disparity may include a local area matching method, a global optimization method, a semi-global method, a method based on a neural network such as a convolutional neural network, and so on.
  • the local area matching method mainly includes matching cost calculation, cost aggregation, disparity calculation, and disparity optimization. It has higher speed and lower energy consumption, but its algorithm effect is related to algorithm parameters (such as the size of the matching window, etc.). It is difficult to meet the needs of complex scenarios.
  • the global optimization method has better matching accuracy. It makes assumptions about the smoothing term, and turns the stereo matching problem of disparity calculation into an energy optimization problem, and most of the global optimization methods skip After the cost aggregation step, the energy function is proposed for the global point by considering the matching cost and the smoothing term, and the parallax is obtained by minimizing the energy function.
  • the global optimization method requires more calculation and higher energy consumption.
  • the semi-global method can balance matching accuracy and operation speed to a certain extent. Unlike the global algorithm that optimizes global points, it divides the energy function of each point into paths in multiple directions, and only needs to evaluate each path , And then add the values of all paths to get the energy of the point, where the evaluation of each path can be done in a dynamic programming way. However, compared with the local area matching method, the semi-global method has higher calculation and energy consumption.
  • neural network methods such as CNN (Convolutional Neural Network, convolutional neural network)
  • CNN Convolutional Neural Network, convolutional neural network
  • a larger perceptual domain can be obtained by constructing a parallax network, and a better parallax prediction ability in the untextured area of the image.
  • the amount of calculation is related to the parameters of neural networks such as CNN and the size of the image. The more complex the network parameters and the larger the image size, the greater the memory consumption and the lower the operating speed.
  • the present disclosure provides a new disparity estimation system, which can perform cascaded multi-level disparity processing based on the image characteristics of each image in the extracted image pair to obtain multiple disparity maps with successively increasing sizes, wherein,
  • the input of the first-level parallax processing in the multi-level parallax processing may include a plurality of image features having a size corresponding to this level of parallax processing; each of the multi-level parallax processing except the first-level parallax processing
  • the input of the first-level disparity processing may include: one or more image features having a size corresponding to the first-level disparity processing, and a disparity map generated by the upper-level disparity processing.
  • the input of each level of parallax processing can include image features with a size corresponding to that level of parallax processing, and multiple different levels can be obtained at one time.
  • the size of the disparity map can be used by multiple target devices with different performance or accuracy requirements, so as to meet the accuracy and speed requirements of different target devices, and also improve the flexibility and applicability of the disparity estimation system.
  • FIG. 1 is a structural block diagram showing a disparity estimation system according to an exemplary embodiment of the present disclosure.
  • the disparity estimation system 100 may include, for example, a feature extraction network 200 configured to perform feature extraction on each image in an image pair, and output the extracted image features to the disparity generation network 300; and
  • the disparity generation network 300 is configured to perform cascaded multi-level disparity processing according to the extracted image features to obtain multiple disparity maps with successively increasing sizes, wherein the first step in the multi-level disparity processing is
  • the input of the first-level parallax processing includes a plurality of image features having sizes corresponding to the first-level parallax processing; the input of each level of parallax processing except the first-level parallax processing in the multi-level parallax processing includes: One or more image features having a size corresponding to this level of parallax processing, and a parallax map generated by the previous level of parallax processing.
  • each level of parallax processing may include image features having a size corresponding to that level of parallax processing.
  • the image pair may be an image pair for the same scene collected by a multi-eye camera.
  • the sizes of the images in the image pair are the same, but the corresponding viewing angles are different.
  • the image pair may also be an image pair obtained by other methods (for example, obtained from another third-party device) that meets the requirements.
  • each image in the image pair may be a grayscale image or a color image.
  • the multi-lens camera refers to a camera that is configured with two, three, or even more cameras that can perform static or dynamic image shooting, which can cover different viewing angles through the configured multiple cameras.
  • the scope of the scene to enhance the camera's ability to detect objects in the scene. Take a binocular camera configured with two cameras (such as left and right cameras) as an example. For any scene, the binocular camera can obtain two images of the scene with the same size but corresponding shooting through the two configured cameras.
  • Images with different perspectives for example, left-eye image and right-eye image
  • the image pair formed by the two images can be used to determine the displacement of objects in the scene between corresponding pixel points in the two images ( For example, the horizontal displacement) is the parallax to determine the depth information such as the distance of the object.
  • the disparity estimation system 100 and the multi-view camera may be independent of each other.
  • the disparity estimation system 100 can perform feature extraction on each image in the image pair for the same scene collected by the multi-eye camera through the included feature extraction network 200, and perform feature extraction on the extracted images through the included disparity generation network 300
  • the feature performs cascaded multi-level disparity processing to obtain multiple disparity maps with successively increasing sizes.
  • a multi-view camera may also be used as a part of the disparity estimation system 100.
  • the disparity estimation system 100 may also include the multi-view camera.
  • the image features of each image in the image pair extracted by the feature extraction network 200 of the disparity estimation system 100 may include at least one or more of the following features: basic structure features, semantic features , Edge features, texture features, color features, object shape features, or features based on the image itself.
  • Figure 2 shows the basis of an image that may be extracted according to an exemplary embodiment of the present disclosure with three images (a), (b), and (c) (for example, grayscale images, or color images).
  • the schematic diagram of the structural feature as can be seen from FIG. 2, the basic structural feature may refer to the features used to reflect various fine structures of the image.
  • FIG. 3 shows four images (for example, grayscale images, or color images) of (a), (b), (c), and (d) that may be extracted according to an exemplary embodiment of the present disclosure.
  • a schematic diagram of the semantic features of an image It can be seen from FIG. 3 that the semantic features may refer to features that can distinguish different objects or objects of different types in the image.
  • the accuracy of the disparity determination in the ambiguous area of the image (for example, a large flat area) can be improved.
  • Fig. 4 shows a schematic diagram of edge features of an image that may be extracted according to an exemplary embodiment of the present disclosure with two images (a) and (b) (for example, grayscale images, or color images). It can be seen from FIG. 4 that the edge feature may refer to the feature that can reflect the boundary information of the object or region in the image.
  • the texture feature, color feature, and object shape feature may respectively refer to features that can be used to reflect the texture, color, and shape of the object included in the image.
  • the feature based on the image itself may refer to the image itself, or may be an image obtained by up-sampling or down-sampling the image itself with a certain coefficient or ratio.
  • the coefficient or ratio of the up-sampling or down-sampling may be 2, 3, or other values greater than 1, for example.
  • each other image feature in addition to the features based on the image itself, can be obtained by extracting the corresponding image by the corresponding feature extraction sub-network, so as to improve the efficiency of image feature extraction and then Improve the efficiency of disparity estimation.
  • image features can be extracted from at least three different dimensions: basic structure features, semantic features, and edge features.
  • FIG. 5 and FIG. 6 show a possible overall structural block diagram of the disparity estimation system 100 according to an exemplary embodiment of the present disclosure
  • the feature extraction network 200 may It includes a plurality of feature extraction sub-networks respectively used to extract different features of the image, and the multiple feature extraction sub-networks may at least include an infrastructure feature sub-network 201 for extracting the basic structure features of the image, and the semantics of the image.
  • the semantic feature sub-network 202 of features and the edge feature sub-network 203 used to extract edge features of the image.
  • the basic structure feature sub-network 201 may adopt VGG (very deep convolutional networks for large scale image recognition, ultra-deep convolutional network for large-scale image recognition) or ResNet (Residual Network, residual network) Any network that can be used to extract the basic structural features of the image.
  • the semantic feature sub-network 202 can adopt any network that can be used to extract the semantic features of the image, such as DeepLabV3+ (encoder-decoder with atrous separable convolution for semantic image segmentation, separable convolutional codec for semantic image segmentation).
  • the edge feature sub-network 203 can adopt any network that can be used to extract edge features of an image, such as a HED (holistically-nested edge detection) network.
  • the HED network may use VGG as the backbone network, and when the edge feature sub-network 203 uses the HED network, the infrastructure feature sub-network 201 and the edge feature sub-network 203 may use the same VGG network , In order to achieve the effect of simplifying the structure of the feature extraction network.
  • the feature extraction network 200 or each feature extraction sub-network included in the feature extraction network 200 may be an extraction network pre-trained based on a training sample set. In this way, the efficiency of image feature extraction can be improved, thereby increasing The efficiency of disparity estimation.
  • the feature extraction network 200 or each feature extraction sub-network included in the feature extraction network 200 may also be obtained by real-time training based on a training sample set, or may also be based on updated training The sample set is obtained by real-time or timing optimization of the pre-trained extraction network to improve the accuracy of the features extracted by the feature extraction network.
  • the training process of the feature extraction network 200 or each feature extraction sub-network included in the feature extraction network 200 may adopt supervised training or unsupervised training, which can be flexibly selected according to actual needs.
  • Supervised training usually uses existing training samples (such as labeled data) to learn the mapping from input to output, and then applies this mapping relationship to unknown data to achieve the purpose of classification or regression.
  • Algorithms for supervised training may include, for example, logistic regression algorithms, SVM (Support Vector Machine, Support Vector Machine) algorithms, decision tree algorithms, and so on.
  • SVM Small Vector Machine
  • the difference between unsupervised training and supervised training is that unsupervised training does not require training samples, but directly models unlabeled data to find out the rules.
  • Typical algorithms can include clustering algorithms, random Forest (Random forests) algorithm and so on.
  • the input of the first-level parallax processing in the multi-level parallax processing may include a plurality of image features having a size corresponding to the multi-level parallax processing;
  • the input of each level of parallax processing except for the first level of parallax processing may include: one or more image features having a size corresponding to the level of parallax processing.
  • the image features extracted by the feature extraction network 200 may include image features of N sizes, N is a positive integer not less than 2.
  • the value of N may be 4 (as shown in FIG. 5 or FIG. 6), of course, it may also be set to 2, 3, 5 or others according to actual needs.
  • the value of N is not as large as possible, but an appropriate value can be selected on the premise of balancing the accuracy requirements of the target device and the speed of the disparity estimation system.
  • each image may refer to the size of a single channel of each image, which can be represented by the height and width of the image, for example, can be represented as H ⁇ W, where H represents the height of the image, W represents the width of the image, and the unit of the two can be pixels.
  • H represents the height of the image
  • W represents the width of the image
  • the unit of the two can be pixels.
  • the size of the image can also be represented by one or more parameters that can reflect the number of pixels, the amount of data, the amount of storage, or the definition of the image.
  • the number of channels is 1, while for color images, since it can have three color channels of R, G, and B, the number of channels can be 3, that is, color
  • the actual size of the image can be expressed as H ⁇ W ⁇ 3.
  • the size of each image in the image pair (that is, the size of the original image that has not been processed by down-sampling and/or up-sampling) may be based on, for example, the multi-camera used to capture the image pair.
  • the size of the sensor and the number of pixels are determined by parameters.
  • the size corresponding to each level of disparity processing may refer to a size consistent with the size of the disparity map required for each level of disparity processing.
  • the size of the image feature may refer to the size of a single channel of the picture formed by the image feature itself, or the size of the extracted image based on the extraction of the image feature of the required size.
  • the extracted image may be It refers to each image itself in the image pair, and it can also be an image obtained by up-sampling or down-sampling each image itself with a certain coefficient or ratio.
  • the full-size image features extracted for the image may be obtained by feature extraction on the image itself Image features; extracted for the image
  • the image feature of the size (which can be called 1/2 size) can be the image feature obtained by down-sampling the image twice to obtain a 1/2 size image, and performing feature extraction on the 1/2 size image .
  • the input of each level of parallax processing except for the first level of parallax processing in the multi-level parallax processing may include one or more image features having a size corresponding to that level of parallax processing, It may also include a disparity map generated by the upper-level disparity processing.
  • the disparity map generated by the first-level disparity processing may be optimized step by step to obtain the disparity map of the corresponding size.
  • the image feature of the smallest size among the image features of the N sizes extracted by the feature extraction network 200 may include, for example, at least one image feature of the first image in the image pair and the first image feature.
  • At least one image feature of the second image, each of the image features of N sizes that is not the minimum size, for example, may include at least one image feature of the first image in the image pair and/ Or at least one image feature of the second image.
  • the image feature of the smallest size among the image features of the N sizes extracted by the feature extraction network 200 may include the first image in the image pair (for example, the left eye).
  • Each non-minimum size image feature of the N sizes of image features extracted by the feature extraction network 200 may include edge features of the first image in the image pair or based on the first image itself. feature.
  • each image feature of each image extracted by the feature extraction network 200 may have one or more sizes, and the number of sizes of the multiple sizes may be less than or equal to N.
  • N may be 4, and the edge features of the first image extracted by the feature extraction network 200 and the features based on the first image itself may be
  • Each has two sizes
  • the extracted basic structure feature and semantic feature of the first image may each have one size
  • the extracted basic structure feature of the second image may have one size.
  • FIG. 5 or FIG. 6 is only an example.
  • Each image feature of each image extracted by the feature extraction network 200 may have one or two sizes as shown, and may also have more Various sizes.
  • taking the value of N as 4 as an example the edge features of the first image extracted by the feature extraction network 200 may also have three or four sizes, which is not limited.
  • the feature extraction network 200 after the feature extraction network 200 extracts the image features of each image in the image pair, it can be stored (for example cached) in a storage device or storage medium for subsequent reading and use. .
  • the feature extraction network 200 can also perform epipolar correction on the images in the image pair before performing image feature extraction on each image in the image pair, so that the images in the image pair are only in one image. There is parallax in the direction (for example, the horizontal direction or the vertical direction). Therefore, the disparity search range of the image can be limited to one direction, thereby improving the efficiency of subsequent feature extraction and disparity generation.
  • the epipolar correction operation of the image in the image pair can also be performed by a multi-camera or other third-party equipment.
  • the multi-eye camera can perform epipolar correction on the images in the image pair, and send the epipolar corrected image pair to the parallax estimation system.
  • the multi-camera camera collects the image pair, it can be sent to another third-party device, and the other third-party device performs epipolar correction on the image in the image pair and the image after epipolar correction The pair is sent to the disparity estimation system.
  • the disparity map with the largest size among the plurality of disparity maps obtained by the disparity generation network 300 may be consistent with the size of each image in the image pair (that is, the original size of each image) .
  • the cascaded multi-level disparity processing at least one disparity map with a relatively high accuracy corresponding to the size of each image in the image pair and a disparity map with other accuracy can be obtained, thereby improving disparity estimation.
  • the size of each disparity map in the plurality of disparity maps may also be smaller than the size of each image in the image pair.
  • the height and width of the next disparity map in any two adjacent disparity maps in the plurality of disparity maps may be twice the height and width of the previous disparity map, respectively.
  • the size of the last disparity map in the multiple disparity maps is H ⁇ W (which can be the same as the size of each image in the image pair) as an example
  • H ⁇ W can be the same as the size of each image in the image pair
  • the value 2 is used as the scaling step size of the height and width of the adjacent disparity map (or the coefficient or ratio of upsampling or downsampling of the adjacent disparity map).
  • the height and width of the next disparity map in any two adjacent disparity maps in the plurality of disparity maps may also be 3 times, 4 times, or other larger than the height and width of the previous disparity map, respectively.
  • a positive integer multiple of 1, and a suitable value can be selected according to the actual accuracy required.
  • the image features extracted by the feature extraction network 200 may include image features of N sizes, where N is a positive integer not less than 2.
  • the disparity generation network may be configured to generate the image features with the smallest size in the first-level disparity processing of the multi-level disparity processing according to at least a part of the image features of the smallest size among the N sizes of image features The initial disparity map of the smallest size; and in each subsequent level of the multi-level disparity processing, according to at least a part of the image features of the corresponding size among the image features of the N sizes, compare the previous one
  • the disparity map generated by the first-level disparity processing is subjected to disparity optimization processing to generate an optimized disparity map with the corresponding size, wherein the multiple disparity maps may include at least each optimized disparity map.
  • the multi-level disparity processing may include N+1 level disparity processing.
  • the parallax generation network may be configured to, in N-level parallax processing other than the first-level parallax processing, in order of size from small to large, sequentially based on images with corresponding sizes among the image features of the N sizes. At least part of the feature, perform disparity optimization processing on the disparity map generated by the upper-level disparity processing to obtain N optimized disparity maps with successively increasing sizes, and use the N optimized disparity maps as the multiple disparity maps, Wherein, the sizes of the N optimized disparity maps correspond to the N types of sizes, respectively.
  • the multi-level disparity processing may include 4+1-level disparity processing.
  • the extracted image features may include image features of 4 sizes, where the image features of the 4 sizes extracted in FIG. 5 are 1/8 size respectively. 1/4 size 1/2 size
  • the full size H ⁇ W, the full size may refer to the size consistent with the size of the original image in the image pair) is taken as an example for schematic description.
  • the disparity generation network may be configured to, in the first-level disparity processing of the multi-level disparity processing, according to at least the smallest size (ie 1/8 size) of the image features of the four sizes of image features A part (for example, the extracted basic structure feature of the 1/8 size of the first image, the semantic feature of the 1/8 size, the edge feature of the 1/8 size, and the part of the 1/8 size of the basic structure feature of the second image or All) to generate an initial disparity map with the minimum size (ie, 1/8 size).
  • a part for example, the extracted basic structure feature of the 1/8 size of the first image, the semantic feature of the 1/8 size, the edge feature of the 1/8 size, and the part of the 1/8 size of the basic structure feature of the second image or All
  • 4-level parallax processing other than the first-level parallax processing in descending order of size, based on at least a part of image features with corresponding sizes among the four-size image features (for example, sequentially based on extraction Part or all of the edge feature of the 1/8 size of the first image obtained, part or all of the edge feature of the 1/4 size of the extracted first image, and the extracted 1/2 size of the first image itself And the extracted full-size features based on the first image itself), perform disparity optimization processing on the disparity map generated by the upper-level disparity processing, and obtain 4 optimized disparity maps with successively increasing sizes (for example, get The optimized disparity map with 1/8 size, the optimized disparity map with 1/4 size, the optimized disparity map with 1/2 size, and the optimized disparity map with full size) are provided, and the 4 optimized disparity maps are used as all Describe multiple disparity maps.
  • the multiple disparity maps obtained by the disparity estimation system 100 may not include the initial disparity map generated by the first-level disparity processing in the multi-level disparity processing, and It includes each optimized disparity map after successively optimizing the initial disparity map generated by the first-level disparity processing, so that the accuracy of the multiple disparity maps obtained by the disparity estimation system can be improved.
  • the multi-level disparity processing may include N-level disparity processing.
  • the parallax generation network may be configured to, in N-1 level parallax processing other than the first level parallax processing, in order of size from small to large, sequentially based on N-1 of the image features of the N sizes. At least a part of the image features with corresponding sizes in the non-minimum size image features are subjected to disparity optimization processing on the disparity map generated by the upper-level disparity processing to obtain N-1 optimized disparity maps with successively increasing sizes, and The initial disparity map and the N-1 optimized disparity maps are used as the multiple disparity maps, wherein the sizes of the initial disparity map and the N-1 optimized disparity maps are respectively the same as those of the N sizes correspond.
  • the multi-level disparity processing may include 4-level disparity processing.
  • the extracted image features may include image features of 4 sizes, where the image features of the 4 sizes extracted in FIG. 6 are respectively 1/8 size 1/4 size 1/2 size
  • the full size H ⁇ W, the full size may refer to the size consistent with the size of each image in the image pair) is taken as an example for schematic description.
  • the disparity generation network may be configured to, in the first-level disparity processing of the multi-level disparity processing, according to at least the smallest size (ie 1/8 size) of the image features of the four sizes of image features A part (for example, the extracted basic structure feature of the 1/8 size of the first image, the semantic feature of the 1/8 size, the edge feature of the 1/8 size, and the part of the 1/8 size of the basic structure feature of the second image or All) to generate an initial disparity map with the minimum size (ie, 1/8 size).
  • a part for example, the extracted basic structure feature of the 1/8 size of the first image, the semantic feature of the 1/8 size, the edge feature of the 1/8 size, and the part of the 1/8 size of the basic structure feature of the second image or All
  • the multiple disparity maps obtained by the disparity estimation system 100 may also include the initial disparity maps generated by the first-level disparity processing in the multi-level disparity processing, to Improve the processing efficiency of the disparity estimation system.
  • the parallax generation network 300 may be configured to, in each level of parallax processing other than the first level of parallax processing in the multi-level parallax processing, based on at least a part of image features with corresponding sizes Perform residual calculation processing on the disparity map generated by the upper-level disparity processing to obtain the residual map with the corresponding size, and combine the residual map with the corresponding size and the upper-level disparity The generated disparity maps are processed and combined to obtain an optimized disparity map with the corresponding size.
  • the edge feature of the 1/8 size of the extracted first image can be Part or all, and the initial disparity map (1/8 size) generated by the upper-level disparity processing to calculate a 1/8 size first residual map, and compare the first residual map with the initial disparity map Combine (for example, add) to obtain a 1/8-size first optimized disparity map as the output of this level of disparity processing.
  • part or all of the edge features of the 1/4 size of the first image extracted can be based on the first optimized disparity map (1 /8 size) to calculate a 1/4 size second residual map, and combine the second residual map with the first optimized disparity map (for example, combine the second residual map with the The 1/4-size up-sampled version of the first optimized disparity map is added together) to obtain the 1/4-size second optimized disparity map as the output of this level of disparity processing, and so on. More specific examples will be discussed later.
  • the disparity generation network 300 may also be configured such that, in each level of disparity processing other than the first level of disparity processing in the multi-level disparity processing, the disparity processing generated by the previous level of disparity processing Before the disparity optimization processing is performed on the disparity map, in response to the size of the disparity map generated by the upper-level disparity processing being smaller than the size corresponding to the disparity processing at the current level, the disparity map generated by the upper-level disparity processing is up-sampled as The size corresponding to this level of parallax processing.
  • the algorithm used for the upsampling may include, for example, nearest neighbor interpolation algorithm, bilinear interpolation algorithm, deconvolution algorithm, and so on.
  • the disparity map targeted by each level of disparity optimization processing may be a disparity map having a size corresponding to that level of disparity processing.
  • the size of the initial disparity map generated by the previous-level disparity processing is not less than
  • This level of disparity processing can be directly based on part or all of the edge features of the 1/8 size of the extracted first image, and perform disparity optimization processing on the initial disparity map generated by the previous level of disparity processing to obtain 1/8 size.
  • the first optimized disparity map of 1/8 size generated by the previous level of parallax processing can be up-sampled to the 1/4 size corresponding to the current level of parallax processing.
  • the up-sampled 1/4-size first optimized disparity map may be subjected to disparity optimization processing to obtain the 1/4-size first image.
  • Optimize the disparity map For example, as mentioned above, in the next-level disparity processing corresponding to 1/4 size, it can be based on part or all of the edge features of the 1/4 size of the extracted first image, and 1/ after upsampling.
  • the 4-size first optimized disparity map is calculated to obtain a 1/4-size residual map, and the 1/4-size residual map is compared with the 1/4-size first optimized disparity map after upsampling. Add the second optimized disparity map of 1/4 size, and so on.
  • the image features based on generating different optimized disparity maps may be the same type of image feature or different types of image features; and/or, the image features based on generating different optimized disparity maps may be the image pair
  • the image characteristics of the same image or different images may be the same type of image features (such as edge features) of the same image (such as the first image) in the image pair;
  • the image features based on the two different optimized disparity maps in the middle can be different types of image features of the same image (for example, the first image) (for example, one is the edge feature of the first image, and the other is based on the first image).
  • the characteristics of the image itself and so on. Therefore, the flexibility and applicability of the disparity estimation system can be further improved by flexibly selecting the image features based on the disparity optimization at all levels.
  • the image features based on which each optimized disparity map is generated may include, for example, edge features of at least one image in the image pair, and/or, based on the features of the image itself of at least one image in the image pair .
  • the generation may be based on the edge feature of the first image.
  • the optimized disparity map of the corresponding size in the disparity optimization processing of the relatively large size corresponding to the latter two levels, the disparity optimization processing of the disparity map generated by the upper-level disparity processing can be performed using the characteristics based on the first image itself instead of the edge features , In order to reduce the amount of calculation required for large-size image feature extraction, and improve the processing efficiency of the disparity estimation system.
  • Fig. 5 is only an example.
  • the image features on which the corresponding optimized disparity map is generated can be not only edge features or features based on the image itself, but also a combination of the two. A combination, or a combination of other extracted one or more image features, and so on.
  • the characteristics of the image itself based on at least one image in the image pair may include, for example, the image itself of the at least one image, or the size of the at least one image according to the size of the optimized disparity map that needs to be generated.
  • the downsampling process can be, for example, for an image with a size of H ⁇ W, if the downsampling coefficient or ratio is K, then each row and column in the original image can be Select one point every K points to form an image.
  • the downsampling coefficient or ratio can be 2, 3 or other values greater than 1, as mentioned above. Of course, this is only an example, and downsampling can also be implemented in other ways, such as averaging K points.
  • the first image itself can be down-sampled based on a down-sampling coefficient of 2, respectively.
  • the first image of 1/2 size and the first image itself replace the edge features of the corresponding size to perform disparity optimization processing on the disparity map generated by the upper-level disparity processing to reduce the amount of calculation required for feature extraction of large-size images , Improve the processing efficiency of the disparity estimation system.
  • the disparity generation network 300 may include an initial disparity generation sub-network 301 and at least one disparity optimization sub-network 302, the initial disparity generation sub-network 301 and the at least one The disparity optimization sub-networks 302 in the disparity optimization sub-network 302 are sequentially cascaded, the initial disparity generation sub-network 301 is configured to perform first-level disparity processing, and the at least one disparity optimization sub-network 302 is configured to perform division All levels of parallax processing other than the first level of parallax processing.
  • the image feature extraction of the disparity estimation system 100 and each level of disparity processing in the multi-level disparity processing can be implemented by a corresponding sub-network.
  • the following will take FIG. 5 as an example, and it will include multiple feature extraction sub-networks (such as infrastructure feature sub-network 201, semantic feature sub-network 202, and edge feature sub-network 203), initial disparity generation sub-network 301, and multiple
  • the overall working process of the disparity estimation system 100 of the disparity optimization sub-network 302 (for example, four) is illustrated schematically.
  • the disparity estimation system 100 may be based on the The multiple feature extraction sub-networks in the feature extraction network 200 perform extraction of image features of the size required for subsequent multi-level parallax processing.
  • the 1/8 size of the first image I1 can be extracted based on the basic structure feature sub-network 201
  • the basic structure features of the 1/8 size of the second image I2 and the 1/8 size of the first image I1 are extracted based on the semantic feature sub-network 202, and the first image I1 is extracted based on the edge feature sub-network 203 Edge features of 1/8 size and 1/4 size Edge characteristics.
  • the feature extraction network 200 of the disparity estimation system 100 can also extract 1/2 size Is based on the characteristics of the first image I1 itself, and the full-size (H ⁇ W) based on the characteristics of the first image I1 itself, that is, the first image I1 itself.
  • the 8-size edge features can be output by the corresponding feature extraction sub-network to the initial disparity generation sub-network 301 to perform the first-level disparity processing to obtain the initial disparity map dispS1 with a size of 1/8.
  • the four disparity optimization sub-networks 302 sequentially cascaded with the initial disparity generation sub-network 301 can be respectively based on the image features with corresponding sizes extracted by the feature extraction network 200 to compare the 1/8-size initial disparity map dispS1 Different levels of disparity optimization processing are sequentially performed to obtain multiple optimized disparity maps with successively increasing sizes.
  • the first disparity optimization sub-network can compare the 1/8-size edge features (part or all) of the first image I1 from the edge feature sub-network 203 to the 1/8-size output of the initial disparity generation sub-network 301.
  • the initial disparity map dispS1 is subjected to disparity optimization processing, and the first optimized disparity map dispS1_refine with a size of 1/8 is obtained.
  • the first disparity optimization sub-network may be calculated based on the edge features (partially or completely) of the 1/8 size of the first image I1 and the initial disparity map dispS1 of the 1/8 size to obtain the 1/8 size of the first image.
  • the residual image, and the 1/8-size first residual image and the 1/8-size initial disparity map dispS1 are added together to obtain the 1/8-size first optimized disparity map dispS1_refine.
  • the second disparity optimization sub-network can, based on the 1/4-size edge features (partially or all) of the first image I1 from the edge feature sub-network 203, perform the first disparity optimization sub-network output of the 1/8-size first image.
  • An optimized disparity map dispS1_refine performs disparity optimization processing to obtain a 1/4-size second optimized disparity map dispS2_refine.
  • the second disparity optimization sub-network may up-sample the 1/8-size first optimized disparity map output by the first disparity optimization sub-network to a 1/4 size corresponding to the current level of disparity processing, and then , Based on the 1/4-size edge features (partially or completely) of the first image I1 and the up-sampled 1/4-size first optimized disparity map to calculate the 1/4-size second residual map, and add 1
  • the /4-size second residual map and the up-sampled 1/4-size first optimized disparity map are added together to obtain a 1/4-size second optimized disparity map dispS2_refine.
  • the third parallax optimization sub-network can be based on the 1/2-size feature (part or all) of the first image I1 extracted by the feature extraction network 200, and the 1/4-size output of the second parallax optimization sub-network
  • the second optimized disparity map dispS2_refine performs disparity optimization processing to obtain the third optimized disparity map dispS3_refine of 1/2 size.
  • the third disparity optimization sub-network can up-sample the 1/4-size second optimized disparity map output by the second disparity optimization sub-network to the 1/2 size corresponding to the current level of disparity processing, and then , Based on the 1/2-size second optimized disparity map based on the characteristics (part or all) of the first image itself and the up-sampled 1/2-size second optimized disparity map, the 1/2-size third residual map is calculated, and 1 The /2-size third residual image and the up-sampled 1/2-size second optimized disparity map are added together to obtain a 1/2-size third optimized disparity map dispS3_refine.
  • the fourth disparity optimization sub-network can be based on the full-size features (part or all) of the first image I1 extracted by the feature extraction network 200, and the third disparity optimization sub-network outputs a 1/2 size third
  • the optimized disparity map dispS3_refine performs disparity optimization processing to obtain a full-size fourth optimized disparity map dispS4_refine.
  • the fourth disparity optimization sub-network may up-sample the 1/2-size third optimization disparity map output by the third disparity optimization sub-network to the full size corresponding to the disparity processing at this level, and then, based on The full-size third optimized disparity map based on the features (partially or completely) of the first image itself and the up-sampled full-size third optimized disparity map is calculated to obtain the full-size fourth residual map, and the full-size fourth residual map is combined with The up-sampled full-size third optimized disparity map is added to obtain the full-size fourth optimized disparity map dispS4_refine.
  • the third and fourth parallax optimization sub-networks use the characteristics of the first image itself to perform parallax optimization processing to reduce the amount of calculation, however, one of them or Both can also use edge features or other features of the first image. Similarly, if the amount of calculation needs to be further reduced, the first and/or second disparity optimization sub-network can also use the features based on the first image itself to replace the extracted edge features and so on.
  • the 1/8-sized first optimized disparity map dispS1_refine, the 1/4-sized second optimized disparity map dispS2_refine, the 1/2-sized third optimized disparity map dispS3_refine, and the full-sized fourth optimized disparity map dispS4_refine can be used as A plurality of disparity maps with successively increasing sizes obtained by the disparity estimation system 100 shown in FIG. 5.
  • the overall working process of the disparity estimation system 100 shown in FIG. 6 is similar to that of the disparity estimation system 100 shown in FIG. 5, except that the size of the initial disparity map generated by the initial disparity generation sub-network 301 is smaller than that of the first disparity map.
  • the size of the optimized disparity map generated by the two disparity optimization sub-networks, and the initial disparity map generated by the initial disparity generation sub-network 301 can be used as one of the multiple disparity maps with successively increasing sizes obtained by the disparity estimation system 100, thus No longer.
  • each of the initial disparity generation sub-network 301 and the at least one disparity optimization sub-network 302 may be a 2DCNN (two-dimensional deep convolutional neural network) or a 3DCNN (three-dimensional deep convolutional neural network).
  • Convolutional neural network that can realize the corresponding disparity processing function, such as product neural network.
  • Using the convolutional neural network as the disparity processing sub-network can obtain a larger perceptual field, thereby improving the accuracy of the disparity map obtained by the disparity estimation system.
  • the initial disparity generation sub-network 301 may include a first number (for example, 5, and of course it can also be based on actual conditions. The corresponding value needs to be flexibly selected) successively cascaded convolution layers (convolution layers).
  • the convolution method of each convolutional layer may adopt depthwise separable convolution, etc., for example.
  • Table 1 For example, the following will use Table 1 to compare the initial disparity of a 2DCNN structure that can be applied to the disparity estimation system shown in FIG. 5 and includes 5 sequentially cascaded convolutional layers (for example, conv1-conv5 in Table 1)
  • the sub-network 301 is generated for schematic description. As an example, this sub-network uses the MobileNetV2 network architecture.
  • Table 1 A description of a possible 2DCNN network structure of the initial disparity generation sub-network 301
  • the corr1d layer can be used to perform the basic structure features of the 1/8 size of the first image and the 1/8 size of the second image extracted by the feature extraction network 200 in Figure 5 Related operations.
  • the semanS1_conv layer can be used to perform convolution processing on the 1/8-size semantic feature of the extracted first image based on a 3 ⁇ 3 convolution kernel.
  • the edgeS1_conv layer can be used to perform convolution processing on the 1/8-size edge features of the extracted first image based on a 3 ⁇ 3 convolution kernel.
  • the concat layer can be used to merge the features output by corr1d, semanS1_conv and edgeS1_conv.
  • the MB_conv operation involved in the conv1-conv5 layer refers to the depthwise separable convolution operation in MobileNetV2
  • the MB_conv_res operation refers to the residual depthwise separable convolution in MobileNetV2 )operating.
  • the conv1 layer, conv2 layer, and conv4 layer can be used to perform depth-separable convolution operations on the output features of the previous layer
  • the conv3 layer and conv5 layer can be used to perform residual depth-separable convolution on the features output by the previous layer.
  • the dispS1 layer can be used to perform a soft argmin calculation on the features output by the previous layer to obtain the initial disparity map dispS1 of the corresponding size (ie, 1/8 size).
  • H and W mentioned in Table 1 can respectively represent the height and width of the image in the image pair of the input disparity estimation system 100
  • D can represent the maximum disparity range of the image
  • the units of the three can be pixels.
  • the value of D may be related to the focal length of each camera and/or the distance between the cameras in the multi-eye camera used to collect the image pair.
  • the number of convolutional layers of the initial disparity generation sub-network 301 using the 2DCNN structure may be determined according to the number of features obtained by the concat layer. For example, if the number of features obtained by the concat layer is large, the number of convolutional layers included in the initial disparity generation sub-network 301 can also be increased.
  • the initial disparity generation sub-network 301 can also adopt a 3DCNN structure to obtain disparity, and the initial disparity generation sub-network 301 adopting a 3DCNN structure can include a second number (for example, 7), of course, it can also be flexible according to actual needs. Select the corresponding value) successively cascaded convolutional layers.
  • Table 2 For example, the following will use Table 2 to compare the initial disparity of a 3DCNN structure that can be applied to the disparity estimation system shown in Figure 5 and includes 7 convolutional layers (for example, conv1-conv7 in Table 2) that are cascaded in sequence.
  • the sub-network 301 is generated for schematic description.
  • Table 2 Description of a possible 3DCNN network structure of the initial disparity generation sub-network 301
  • the edgeS1_conv layer can be used to perform convolution processing on the 1/8-size edge feature of the extracted first image based on a 3 ⁇ 3 convolution kernel.
  • the semanS3_conv layer can be used to perform convolution processing on the 1/8-size semantic feature of the extracted first image based on a 3 ⁇ 3 convolution kernel.
  • the concat layer can be used to merge the features output by featS1, semanS1_conv, and edgeS1_conv.
  • featS1 can refer to the extracted basic structure features of the 1/8 size of the first image and the second image The basic structure features of 1/8 size.
  • the cost layer can be used to translate the features output by the concat layer.
  • the conv1 to conv7 layers can be respectively used to perform convolution operations on the output features of the previous layer based on the 3 ⁇ 3 ⁇ 3 convolution kernel.
  • the conv2 layer, conv4 layer and conv6 layer can be equivalent to the residual module of the 3DCNN network. It can also be used to add the convolution result and the output result of the previous layer after the convolution operation is performed on the features output by the previous layer.
  • the dispS1 layer can be used to perform soft argmin calculation on the features output by the previous layer to obtain an initial disparity map dispS1 of the corresponding size (ie, 1/8 size).
  • H and W mentioned in Table 2 may respectively represent the height and width of the image in the image pair input to the disparity estimation system 100.
  • F can indicate the number of characteristic channels
  • 1F indicates that the number of channels is F
  • 3F indicates that the number of channels is 3 ⁇ F
  • the number of convolutional layers of the initial disparity generation sub-network 301 using the 3DCNN structure can also be determined according to the number of features obtained by the concat layer. For example, if the number of features obtained by the concat layer is large, the number of convolutional layers included in the initial disparity generation sub-network 301 can also be increased.
  • the number of convolutional layers included in each disparity optimization sub-network 302 in the at least one disparity optimization sub-network 302 may be less than the number of convolutional layers included in the initial disparity generation sub-network 301.
  • the number of convolutional layers included in each disparity optimization sub-network 302 can be three, of course, it can also be set to other values according to actual needs.
  • each disparity optimization sub-network 302 may also adopt a 3DCNN structure, which is not limited.
  • the edgeS1_conv layer can be used to perform convolution processing on the 1/8-size edge features of the extracted first image based on a 3 ⁇ 3 convolution kernel.
  • the concat layer can be used to merge the 1/8-size initial disparity map dispS1 generated by the previous disparity processing (ie, the initial disparity generation processing) and the features output by edgeS1_conv.
  • the conv1 layer and the conv3 layer can be used to perform a depth-separable convolution operation on the features output by the previous layer
  • the conv2 layer can be used to perform a residual depth-separable convolution operation on the features output by the previous layer
  • the dispS1_refine layer can be used to superimpose the features output by the previous layer of conv3 and the 1/8-size initial disparity map dispS1 generated by the previous-level disparity processing to obtain the first optimization of the corresponding size (ie 1/8 size) The disparity map dispS1_refine.
  • the dispS1_up layer can be used to upsample the 1/8-size first optimized disparity map dispS1_refine generated by the upper-level disparity processing (that is, the first-level disparity optimization processing) to obtain a 1/4 size
  • the edgeS2_conv layer can be used to perform convolution processing on the 1/4-size edge features of the extracted first image based on a 3 ⁇ 3 convolution kernel.
  • the concat layer can be used to merge the features output by the 1/4-size optimized disparity map dispS1_up and edgeS2_conv after upsampling.
  • the conv1 layer and the conv3 layer can be used to perform a depth-separable convolution operation on the features output by the previous layer
  • the conv2 layer can be used to perform a residual depth-separable convolution operation on the features output by the previous layer
  • the dispS2_refine layer can be used to superimpose the features output by the previous layer of conv3 and the 1/4-size optimized disparity map dispS1_up after upsampling processing to obtain the second optimized disparity map dispS2_refine of the corresponding size (ie 1/4 size) .
  • Table 5 Description of a possible 2DCNN network structure of the third parallax optimization sub-network
  • the dispS2_up layer can be used to up-sample the 1/4-size second optimized disparity map dispS2_refine generated by the upper-level disparity processing (that is, the second-level disparity optimization processing) to obtain 1/2 size The optimized disparity map dispS2_up.
  • the imgS3 layer can be used to down-sample the first image itself to obtain 1/2-size features based on the first image itself, where I1 in Table 5 represents the first image.
  • the concat layer can be used to merge the output features of the 1/2-size optimized disparity map dispS2_up and imgS3 after upsampling.
  • the conv1 layer, conv2 layer and conv3 layer can be used to convolve the output features of the previous layer, respectively, and the dispS3_refine layer can be used to optimize the disparity of the features output by the previous layer conv3 and the 1/2 size after upsampling processing.
  • the map dispS2_up performs a superposition operation to obtain the third optimized disparity map dispS3_refine of the corresponding size (ie, 1/2 size).
  • Table 6 Description of a possible 2DCNN network structure of the fourth parallax optimization sub-network
  • the dispS3_up layer can be used to up-sampling the 1/2-size third optimized disparity map dispS3_refine generated by the upper-level disparity processing (that is, the third-level disparity optimization processing) to obtain full-size optimization The disparity map dispS3_up.
  • the concat layer can be used to merge the full-size optimized disparity map dispS3_up after the upsampling process and the first image itself, where I1 in Table 6 represents the first image.
  • the conv1, conv2, and conv3 layers can be used to convolve the output features of the previous layer
  • the dispS4_refine layer can be used to perform the convolution operation on the features output by the previous layer conv3 and the full-size optimized disparity map dispS3_up after upsampling processing. Perform the superposition operation to obtain the fourth optimized disparity map dispS4_refine of the corresponding size (ie, full size).
  • H and W mentioned in Tables 3 to 6 may respectively represent the height and width of the image in the image pair input to the disparity estimation system 100.
  • the number of convolutional layers of each parallax optimization sub-network 302 adopting the 2DCNN structure may also be determined according to the number of features obtained by the concat layer. For example, if the number of features obtained by the concat layer is large, the number of convolutional layers included in each parallax optimization sub-network 302 can also be increased.
  • each of the initial disparity generation sub-network 301 and the at least one disparity optimization sub-network 302 may be a pre-trained network based on a training sample set, so that the efficiency of disparity processing can be improved.
  • each of the initial disparity generation sub-network 301 and the at least one disparity optimization sub-network 302 may also be obtained by real-time training based on a training sample set, or may also be based on updating The latter training sample set is obtained by real-time or timing optimization of the pre-trained network to improve the accuracy of disparity generation.
  • the training process of each of the initial disparity generation sub-network 301 and the at least one disparity optimization sub-network 302 can also adopt supervised training or unsupervised training, which can be flexibly selected according to actual needs.
  • supervised training and unsupervised training reference may be made to the relevant descriptions in the foregoing related embodiments, which will not be repeated here.
  • each of the initial disparity generation sub-network 301 and the at least one disparity optimization sub-network 302 may also be configured to calculate a loss function, and the loss function may be used to represent the sub-network generation The error between the disparity in the disparity map and the corresponding true disparity. In this way, by calculating the loss function, the accuracy of each disparity map generated by the disparity estimation system can be clarified. In addition, the corresponding system can also be optimized based on the loss function.
  • each disparity processing sub-network or each level of disparity processing outputs
  • the value of n is 1 to N (corresponding to the disparity estimation system shown in FIG. 5), or 0 to N (corresponding to the disparity estimation system shown in FIG. 6).
  • the function f represents the difference between the predicted disparity (Disp Sn ) and the real disparity (Disp GTn ), and g represents the disparity continuity constraint.
  • g(x)
  • the edge feature can also be considered as the regular term of the loss function, and there is no restriction on this.
  • the final loss function of the disparity estimation system 100 may be the sum of the loss functions output by each disparity processing sub-network or all levels of disparity processing.
  • each of the initial disparity generation sub-network 301 and the at least one disparity optimization sub-network 302 adopts unsupervised training it can be obtained by reconstructing the image and calculating the reconstruction error.
  • the initial disparity generation sub-network 301 and each of the at least one disparity optimization sub-network 302 will adopt supervised training
  • the training set will use Scene Flow
  • the structure of the disparity estimation system is shown in the figure 5 shows an example, combined with Fig. 7A, Fig. 7B and Fig. 8 to illustrate the training based on the reference image, the corresponding true-value disparity map, and the results of the training parameters applied to the Middlebury data set image test results obtained .
  • FIG. 7A and 7B respectively show a schematic diagram of a reference image and a corresponding ground truth map on which a network is trained according to an exemplary embodiment of the present disclosure
  • FIG. 8 shows an exemplary implementation according to the present disclosure.
  • Example of using the trained disparity estimation system to perform cascaded multi-level disparity processing on the reference image shown in FIG. 7A which is a schematic diagram of multiple disparity maps with increasing sizes from right to left (i.e. the application of the parameters after training) The result of the test on the Middlebury data set picture). It can be seen from the above drawings that the sizes of the obtained multiple disparity maps can be increased sequentially, and the accuracy can be sequentially increased, and the accuracy of the disparity map of the largest size is close to the true disparity map.
  • FIG. 7A is a schematic diagram of multiple disparity maps with increasing sizes from right to left (i.e. the application of the parameters after training) The result of the test on the Middlebury data set picture).
  • FIG. 7A, FIG. 7B, and FIG. 8 respectively illustrate the reference image, the true value disparity map, and the generated multiple disparity maps in the form of grayscale images, it can be understood that when the reference image shown in FIG. 7A When it is a color image, the disparity maps shown in FIG. 7B and FIG. 8 may also be corresponding color images.
  • the disparity generation network 300 may also be configured to select a disparity map whose size matches the performance of the target device from the multiple disparity maps as provided to the target device according to the performance of the target device.
  • the disparity map of the target device For example, if the performance of the target device is high, and/or the accuracy of the required disparity map is high, a disparity map with a larger size may be selected from the multiple disparity maps and provided to the target device.
  • the target device can also actively obtain the disparity map it needs from the multiple disparity maps obtained by the disparity estimation system according to its own performance, and there is no limitation on this.
  • the multiple disparity maps obtained by the disparity estimation system may also be provided to the corresponding target device for further processing, for example, to the corresponding target device so that the target device is based on the disparity.
  • the graph is calculated to obtain the depth map, and then the depth information of the scene is obtained, which can be applied to various application scenarios such as three-dimensional reconstruction, automatic driving, and obstacle detection.
  • FIG. 9 shows a flowchart of a disparity estimation method according to an exemplary embodiment of the present disclosure.
  • the disparity estimation method of the present disclosure may include the following steps: perform feature extraction on each image in the image pair (step S901); and perform cascaded multi-level disparity processing according to the extracted image features to obtain A plurality of disparity maps with successively increasing sizes (step S902), wherein the input of the first-level disparity processing in the multi-level disparity processing includes a plurality of image features having sizes corresponding to the disparity processing of the level;
  • the input of each level of parallax processing except the first level of parallax processing includes: one or more image features having a size corresponding to the level of parallax processing, and the previous level of parallax processing.
  • the generated disparity map is described by the input of each level of parallax processing except the first level of parallax processing.
  • the image pair may be an image pair for the same scene captured by a multi-eye camera.
  • the sizes of the images in the image pair are the same, but the corresponding viewing angles are different.
  • each image in the image pair may be a grayscale image or a color image.
  • the extracted image features of each image in the image pair may include at least one or more of the following features: basic structure features, semantic features, edge features, texture features, color features, objects Shape characteristics, or based on the characteristics of the image itself.
  • the extracted image features of the first image (such as the left eye image) in the image pair may include basic structure features, semantic features, and edge features
  • the extracted second image (such as the left eye image) of the image pair
  • the image features of the right-eye image) may include basic structure features.
  • the extracted image features of the first image and the second image in the image pair may include basic structure features, semantic features, edge features, and so on.
  • the disparity map with the largest size in the plurality of disparity maps may be consistent with the size of each image in the image pair.
  • the size of each disparity map in the plurality of disparity maps is also Both may be smaller than the size of each image in the image pair.
  • the height and width of the next disparity map in any two adjacent disparity maps in the plurality of disparity maps may be twice the height and width of the previous disparity map, of course, it can also be based on actual needs.
  • the precision is set to 3 times, 4 times, or other positive integer multiples greater than 1, respectively, of the height and width of the previous disparity map.
  • the size of the last disparity map in the multiple disparity maps is H ⁇ W (which can be the same as the size of each image in the image pair) as an example
  • H ⁇ W can be the same as the size of each image in the image pair
  • the sizes of the other disparity maps in the plurality of disparity maps before it can be sequentially (If H ⁇ W can be called full size, then Can be called 1/2 size), (May be called 1/4 size), and (Can be called 1/8 size).
  • the extracted image features may include image features of N sizes, and the N is a positive integer not less than 2.
  • FIG. 10 shows a flowchart of multi-level disparity processing in the disparity estimation method according to an exemplary embodiment of the present disclosure
  • the cascaded multi-level is performed according to the extracted image features.
  • the disparity processing to obtain multiple disparity maps with successively increasing sizes may include the following steps.
  • Step S1001 In the first-level disparity processing of the multi-level disparity processing, generate an initial disparity map with the minimum size according to at least a part of the image features of the smallest size among the image features of the N sizes.
  • the multi-level parallax processing of the processing For example, taking the extracted image features of the N sizes including the image features of 1/8 size, 1/4 size, 1/2 size, and full size as an example, in the multi-level parallax In the first-level parallax processing of the processing, according to at least a part of the image feature of the smallest size (ie 1/8 size) among the image features of the four sizes, it is possible to generate the image feature with the smallest size (ie 1/8 size) The initial disparity map.
  • the corresponding image features of the corresponding size can be superimposed on the disparity shift, and the 3DCNN can be used to obtain the initial disparity map, or the difference between the image features of the corresponding size after the translation can be calculated. Use 2DCNN to obtain the initial disparity map.
  • Step S1002 In each subsequent level of parallax processing of the multi-level parallax processing, according to at least a part of the image features of the corresponding size among the image features of the N sizes, compare the parallax generated by the previous level of parallax processing.
  • the image is subjected to disparity optimization processing to generate an optimized disparity map with the corresponding size, wherein the multiple disparity maps include at least each optimized disparity map.
  • the multi-level disparity processing may include N+1 level disparity processing.
  • the previous level of parallax processing is generated Perform disparity optimization processing on the disparity map to generate an optimized disparity map with the corresponding size, where the multiple disparity maps include at least each optimized disparity map, which may include:
  • the N-level parallax processing other than the first-level parallax processing in order of size from small to large, based on at least a part of the image features of the corresponding size among the N types of image features, compare the upper-level parallax Process the generated disparity map to perform disparity optimization processing to obtain N optimized disparity maps with successively increasing sizes, and use the N optimized disparity maps as the multiple disparity maps, wherein the sizes of the N optimized disparity maps Corresponding to the N types of sizes.
  • the extracted image features of the N sizes include image features of 1/8 size, 1/4 size, 1/2 size, and full size
  • the multi-level parallax processing may include Take 4+1 level parallax processing as an example.
  • 4-level parallax processing other than the first level parallax processing in order of size from small to large, images with corresponding sizes can be sequentially based on the image features of the four sizes.
  • At least part of the features perform disparity optimization processing on the disparity map generated by the upper-level disparity processing, and obtain 4 optimized disparity maps with increasing sizes (for example, get an optimized disparity map with 1/8 size and 1/4
  • the optimized disparity map of the size, the optimized disparity map of 1/2 size, and the optimized disparity map of full size), and the four optimized disparity maps are used as the multiple disparity maps.
  • the multi-level disparity processing may include N-level disparity processing.
  • the previous level of parallax processing is generated Perform disparity optimization processing on the disparity map to generate an optimized disparity map with the corresponding size, where the multiple disparity maps include at least each optimized disparity map, which may include:
  • the N-1 level of parallax processing other than the first level of parallax processing in the order of size from small to large, based on the N-1 types of non-minimum size image features among the N types of image features.
  • the extracted image features of the N sizes include image features of 1/8 size, 1/4 size, 1/2 size, and full size
  • the multi-level parallax processing may include 4
  • the third-level parallax processing other than the first-level parallax processing in the order of size from small to large, it can be based on the image features of the other three non-minimum size image features that have the corresponding size.
  • At least a part of the disparity map generated by the upper-level disparity processing is subjected to disparity optimization processing to obtain 3 optimized disparity maps with successively increasing sizes (for example, an optimized disparity map with 1/4 size is obtained, and an optimized disparity map with 1/2 size is obtained.
  • the optimized disparity map of and the full-size optimized disparity map), and the initial disparity map and the three optimized disparity maps are used as the multiple disparity maps.
  • the obtained multiple disparity maps may or may not include the initial disparity map generated by the first-level disparity processing in the multi-level disparity processing, so as to improve the flexibility of disparity generation.
  • the previous level of parallax processing is generated Performing disparity optimization processing on the disparity map to generate an optimized disparity map with the corresponding size, which may include:
  • the parallax map generated by the previous level of parallax processing is residual Difference calculation processing to obtain a residual map with the corresponding size, and combine the residual map with the corresponding size and the disparity map generated by the upper-level disparity processing to obtain an optimization with the corresponding size Parallax map.
  • the extracted image features of the N sizes include image features of 1/8 size, 1/4 size, 1/2 size, and full size
  • the multi-level parallax processing may include 4 Take the +1 level parallax processing as an example.
  • the parallax processing corresponding to the 1/8 size of the other 4-level parallax processing except the first level parallax processing (that is, the parallax optimization processing corresponding to the 1/8 size)
  • it can be based on the extracted Part or all of the image features of the 1/8 size, and the initial disparity map generated by the upper-level disparity processing to calculate the 1/8 size first residual map, and based on the first residual map and the The initial disparity map is calculated to obtain the first optimized disparity map with a size of 1/8.
  • 1/4 can be calculated based on part or all of the extracted 1/4 size image features and the first optimized parallax map generated by the previous level of parallax processing
  • the second residual image of the size is calculated based on the second residual image and the first optimized disparity image to obtain a second optimized disparity image of 1/4 size, and so on.
  • the method may further include: in response to the size of the disparity map generated by the upper-level disparity processing being smaller than the size corresponding to the current-level disparity processing, up-sampling the disparity map generated by the upper-level disparity processing to be the same as that of the current-level disparity processing.
  • the size corresponding to the parallax processing in each level of parallax processing other than the first level of parallax processing in the multi-level parallax processing, before performing parallax optimization processing on the parallax map generated by the previous level of parallax processing.
  • the extracted image features of the N sizes include image features of 1/8 size, 1/4 size, 1/2 size, and full size
  • the multi-level parallax processing may include Take 4+1 level parallax processing as an example.
  • the upper The 1/8-size first optimized disparity map generated by the first-level disparity processing is up-sampled to the 1/4-size corresponding to the current-level disparity processing, and then it can be based on the extracted 1/4-size image feature part or All, performing disparity optimization processing on the up-sampled first optimized disparity map of 1/4 size to obtain a second optimized disparity map of 1/4 size.
  • the smallest size image feature among the N sizes of image features may include, for example, at least one image feature of the first image and at least one image feature of the second image in the image pair.
  • the image feature of the smallest size among the image features of the N sizes may include the basic structure feature, semantic feature, edge feature of the first image in the image pair (for example, the left eye image), and the image feature in the image pair.
  • the basic structural features of the second image for example, the right-eye image).
  • Each non-minimum size image feature of the N types of image features may include, for example, at least one image feature of the first image and/or at least one of the second image in the image pair.
  • Kind of image features may include an edge feature of the first image in the image pair or a feature based on the first image itself.
  • the image features based on generating different optimized disparity maps may be the same type of image feature or different types of image features; and/or, the image features based on generating different optimized disparity maps may be said The image characteristics of the same image or different images in an image pair.
  • the image characteristics based on which each optimized disparity map is generated may include, for example, the edge characteristics of at least one image in the image pair, and/or the characteristics of the image itself based on at least one image in the image pair.
  • the feature of the image itself based on at least one image in the image pair may include, for example, the image itself of the at least one image, or the image itself of the at least one image may be performed according to the size of the optimized disparity map that needs to be generated. Downsampling the resulting image.
  • the disparity estimation method may further include: calculating the loss function of each stage of the disparity processing in the multi-stage disparity processing, and the loss function may be used to represent the disparity map generated by the disparity processing of that stage. The error between the parallax and the corresponding true parallax. In this way, by calculating the loss function, the accuracy of each generated disparity map can be clarified, and the disparity estimation method can also be optimized based on the loss function.
  • the disparity estimation method may further include: according to the performance of the target device, selecting a disparity map whose size matches the performance of the target device from the plurality of disparity maps as provided to the target device Disparity map. For example, if the performance of the target device is high, and/or the accuracy of the required disparity map is high, a disparity map with a larger size may be selected from the multiple disparity maps and provided to the target device.
  • the target device may also actively obtain the required disparity map from the multiple disparity maps obtained by the disparity estimation system according to its own performance.
  • the disparity estimation method may further include: before performing image feature extraction on each image in the image pair, performing epipolar correction on the image in the image pair, so that the image in the image pair is only There is parallax in one direction (for example, the horizontal direction). Therefore, the disparity search range of the image can be limited to one direction, thereby improving the efficiency of subsequent feature extraction and disparity generation.
  • An aspect of the present disclosure may include an electronic device, which may include a processor; and a memory storing a program, the program including instructions that when executed by the processor cause the processor to execute the foregoing Any method.
  • An aspect of the present disclosure may include a computer-readable storage medium storing a program, the program including instructions that, when executed by a processor of an electronic device, cause the electronic device to perform any of the foregoing methods.
  • the computing device 2000 can be any machine configured to perform processing and/or calculations, and can be, but is not limited to, a workstation, a server, a desktop computer, a laptop computer, a tablet computer, a personal digital assistant, a smart phone, a vehicle-mounted computer, or any of these combination.
  • the above-mentioned electronic device may be implemented in whole or at least in part by the computing device 2000 or similar devices or systems.
  • the computing device 2000 may include elements connected to or in communication with the bus 2002 (possibly via one or more interfaces).
  • the computing device 2000 may include a bus 2002, one or more processors 2004, one or more input devices 2006, and one or more output devices 2008.
  • the one or more processors 2004 may be any type of processor, and may include, but are not limited to, one or more general-purpose processors and/or one or more special-purpose processors (for example, special processing chips).
  • the input device 2006 may be any type of device that can input information to the computing device 2000, and may include, but is not limited to, a mouse, a keyboard, a touch screen, a microphone, and/or a remote control.
  • the output device 2008 may be any type of device that can present information, and may include, but is not limited to, a display, a speaker, a video/audio output terminal, a vibrator, and/or a printer.
  • the computing device 2000 may also include a storage device 2010 or be connected to the storage device 2010.
  • the storage device may be any storage device that is non-transitory and can realize data storage, and may include, but is not limited to, a magnetic disk drive, an optical storage device, a solid state memory, and a floppy disk.
  • the storage device 2010 can be detached from the interface.
  • the storage device 2010 may have data/programs (including instructions)/code for implementing the above-mentioned methods and steps.
  • the computing device 2000 may also include a communication device 2012.
  • the communication device 2012 may be any type of device or system that enables communication with external devices and/or with the network, and may include, but is not limited to, a modem, a network card, an infrared communication device, a wireless communication device, and/or a chipset, such as BluetoothTM Device, 1302.11 device, WiFi device, WiMax device, cellular communication device, and/or the like.
  • the computing device 2000 may also include a working memory 2014, which may be any type of working memory that can store programs (including instructions) and/or data useful for the work of the processor 2004, and may include, but is not limited to, random access memory and / Or read-only memory device.
  • working memory 2014 may be any type of working memory that can store programs (including instructions) and/or data useful for the work of the processor 2004, and may include, but is not limited to, random access memory and / Or read-only memory device.
  • the software elements may be located in the working memory 2014, including but not limited to the operating system 2016, one or more applications (ie, application programs) 2018, drivers, and/or other data and codes. Instructions for performing the above-mentioned methods and steps may be included in one or more applications 2018, and the feature extraction network 200 and the disparity generation network 300 of the above-mentioned disparity estimation system 100 may be read and executed by the processor 2004. This can be achieved by using the instructions of 2018. More specifically, the feature extraction network 200 of the aforementioned disparity estimation system 100 can be implemented, for example, by the processor 2004 executing the application 2018 having instructions for executing step S901.
  • the disparity generation network 300 of the aforementioned disparity estimation system 100 may be implemented, for example, by the processor 2004 executing the application 2018 having instructions for executing step S902, and so on.
  • the executable code or source code of the instructions of the software element (program) can be stored in a non-transitory computer-readable storage medium (such as the aforementioned storage device 2010), and can be stored in the working memory 2014 (which may be compiled during execution). And/or installation).
  • the executable code or source code of the instructions of the software element (program) can also be downloaded from a remote location.
  • the client can receive data input by the user and send the data to the server.
  • the client can also receive the data input by the user, perform part of the processing in the foregoing method, and send the data obtained by the processing to the server.
  • the server can receive data from the client, execute the foregoing method or another part of the foregoing method, and return the execution result to the client.
  • the client can receive the execution result of the method from the server, and can present it to the user through an output device, for example.
  • the components of the computing device 2000 may be distributed on a network.
  • one processor may be used to perform some processing, while at the same time another processor remote from the one processor may perform other processing.
  • Other components of computing device 2000 can be similarly distributed. In this way, the computing device 2000 can be interpreted as a distributed computing system that performs processing in multiple locations.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Geometry (AREA)
  • Image Analysis (AREA)
  • Image Processing (AREA)

Abstract

提供一种视差估计***、方法、电子设备及计算机可读存储介质。所述视差估计***包括:特征提取网络,被配置为对图像对中的各图像进行特征提取,并将提取到的图像特征输出给视差生成网络;以及所述视差生成网络,被配置为根据所述提取到的图像特征,进行级联的多级视差处理,得到多个尺寸依次增大的视差图,其中,所述多级视差处理中的第一级视差处理的输入包括多个具有与该级视差处理相对应的尺寸的图像特征;所述多级视差处理中的除所述第一级视差处理以外的每一级视差处理的输入包括:一个或多个具有与该级视差处理相对应的尺寸的图像特征,以及上一级视差处理所生成的视差图。

Description

视差估计***、方法、电子设备及计算机可读存储介质 技术领域
本公开涉及计算机视觉技术领域,特别涉及一种视差估计***、方法、电子设备及计算机可读存储介质。
背景技术
相关技术中,可采用计算机视觉技术获取同一场景的两幅不同视角的图像中的各对相匹配的像素点之间的视差,得到视差图,并基于所述视差图得到所述场景的深度信息,其中,所述深度信息可用于三维重建、自动驾驶、障碍物检测等各种领域。示例的,采用计算机视觉技术获取视差的方法可以包括局部区域匹配方法、全局优化方法、半全局方法以及基于卷积神经网络的方法,等等。
在此部分中描述的方法不一定是之前已经设想到或采用的方法。除非另有指明,否则不应假定此部分中描述的任何方法仅因其包括在此部分中就被认为是现有技术。类似地,除非另有指明,否则此部分中提及的问题不应认为在任何现有技术中已被公认。
发明内容
根据本公开的一个方面,提供一种视差估计***,包括:特征提取网络,被配置为对图像对中的各图像进行特征提取,并将提取到的图像特征输出给视差生成网络;以及所述视差生成网络,被配置为根据所述提取到的图像特征,进行级联的多级视差处理,得到多个尺寸依次增大的视差图,其中,所述多级视差处理中的第一级视差处理的输入包括多个具有与该级视差处理相对应的尺寸的图像特征;所述多级视差处理中的除所述第一级视差处理以外的每一级视差处理的输入包括:一个或多个具有与该级视差处理相对应的尺寸的图像特征,以及上一级视差处理所生成的视差图。
根据本公开的另一个方面,提供一种视差估计方法,包括:对图像对中的各图像进行特征提取;以及根据提取到的图像特征,进行级联的多级视差处理,得到多个尺寸依次增大的视差图,其中,所述多级视差处理中的第一级视差处理的输入包括多个具有与该级视差处理相对应的尺寸的图像特征;所述多级视差处理中的除所述第一级视差处理以外的每一级视差处理的输入包括:一个或多个具有与该级视差处理相对应的尺寸的图像特征,以及上一级视差处理所生成的视差图。
根据本公开的另一个方面,提供一种电子设备,包括:处理器;以及存储程序的存储器,所述程序包括指令,所述指令在由所述处理器执行时使所述处理器执行本公开中所述的方法。
根据本公开的另一个方面,提供一种存储程序的计算机可读存储介质,所述程序包括指令,所述指令在由电子设备的处理器执行时,致使所述电子设备执行本公开中所述的方法。
从下面结合附图描述的示例性实施例中,本公开的更多特征和优点将变得清晰。
附图说明
附图示例性地示出了实施例并且构成说明书的一部分,与说明书的文字描述一起用于讲解实施例的示例性实施方式。所示出的实施例仅出于例示的目的,并不限制权利要求的范围。在所有附图中,相同的附图标记指代类似但不一定相同的要素。
图1是示出根据本公开示例性实施例的视差估计***的结构框图;
图2是示出根据本公开示例性实施例的图像的基础结构特征的示意图;
图3是示出根据本公开示例性实施例的图像的语义特征的示意图;
图4是示出根据本公开示例性实施例的图像的边缘特征的示意图;
图5是示出根据本公开示例性实施例的视差估计***的一种可能的整体结构框图;
图6是示出根据本公开示例性实施例的视差估计***的另一种可能的整体结构框图;
图7A和图7B分别是示出根据本公开示例性实施例的网络训练时所基于的参考图像以及对应的真值视差图的示意图;
图8是示出根据本公开示例性实施例的采用训练后的视差估计***对图7A所示的参考图像进行级联的多级视差处理所得到的从右到左尺寸依次增大的多个视差图的示意图;
图9是示出根据本公开示例性实施例的视差估计方法的流程图;
图10是示出根据本公开示例性实施例的视差估计方法中的多级视差处理的流程图;
图11是示出能够应用于本公开示例性实施例的示例性计算设备的结构框图。
具体实施方式
在本公开中,除非另有说明,否则使用术语“第一”、“第二”等来描述各种要素不意图限定这些要素的位置关系、时序关系或重要性关系,这种术语只是用于将一个元件与另一元件区分开。在一些示例中,第一要素和第二要素可以指向该要素的同一实例,而在某些情况下,基于上下文的描述,它们也可以指代不同实例。
在本公开中对各种所述示例的描述中所使用的术语只是为了描述特定示例的目的,而并非旨在进行限制。除非上下文另外明确地表明,如果不特意限定要素的数量,则该要素可以是一个也可以是多个。此外,本公开中所使用的术语“和/或”涵盖所列出的项目中的任何一个以及全部可能的组合方式。
相关技术中,可采用计算机视觉技术获取同一场景的两幅不同视角的图像中的各对相匹配的像素点之间的视差,得到视差图,并基于所述视差图得到所述场景的深度信息,其中,所述深度信息可用于三维重建、自动驾驶、障碍物检测等各种领域。示例的,采用计算机视觉技术获取视差的方法可以包括局部区域匹配方法、全局优化方法、半全局方法以及基于卷积神经网络等神经网络的方法,等等。
局部区域匹配方法主要包括匹配代价计算、代价聚合、视差计算以及视差优化等步骤,具有较高的速度和较低的能耗,但是其算法效果与算法参数(例如匹配窗口的大小等)有关,较难满足复杂场景的需求。与局部区域匹配方法相比,全局优化方法具有更好的匹配精度,其对于平滑项作出了假设,并将视差计算这一立体匹配问题变成了一个能量优化问题,且大部分全局优化方法跳过了代价聚合步骤,通过考虑匹配代价和平滑项,针对全局的点提出了能量函数,并通过最小化能量函数,获得视差。但是,相比于局部区域匹配方法,全局优化方法的计算量更大,能耗更高。半全局方法可在一定程度上平衡匹配精度和运算速度,与全局算法对全局的点进行优化不同,它将每个点的能量函数分为多个方向的路径,只需对每条路径求值,然后将所有路径的值相加即可得到该点的能量,其中,各路径的求值可以采用动态规划的方式。不过,半全局方法相比于局部区域匹配方法,计算量以及能耗也均较高。基于CNN(Convolutional Neural Network,卷积神经网络)等神经网络的方法通过构建视差网络可以获得更大的感知域,在图像的无纹理区域有更好的视差预测能力。不过,其计算量与CNN等神经网络的参数以及图像大小有关,网络参数越复杂以及图像尺寸越大,内存消耗越大、运行速度也越低。
本公开提供了一种新的视差估计***,其可基于提取到的图像对中的各图像的图像特征,进行级联的多级视差处理,得到多个尺寸依次增大的视差图,其中,所述多级视差处理中的第一级视差处理的输入可包括多个具有与该级视差处理相对应的尺寸的图像特征;所述多级视差处理中的除第一级视差处理以外的每一级视差处理的输入可包括:一个或多个具有与该级视差处理相对应的尺寸的图像特征,以及上一级视差处理所生成的视差图。换言之,通过对提取到的图像特征进行级联的多级视差处理,其中,每一级视差处理的输入可包括具有与该级视差处理相对应的尺寸的图像特征,可一次性获得多 个不同尺寸的视差图以供多个不同性能或不同精度要求的目标设备使用,从而可满足不同目标设备对精度与速度的要求,且还可提高视差估计***的灵活性以及适用性。以下将结合附图对本公开的视差估计***的示例性实施例进行进一步描述。
图1是示出根据本公开的示例性实施例的视差估计***的结构框图。如图1所示,所述视差估计***100例如可以包括:特征提取网络200,被配置为对图像对中的各图像进行特征提取,并将提取到的图像特征输出给视差生成网络300;以及所述视差生成网络300,被配置为根据所述提取到的图像特征,进行级联的多级视差处理,得到多个尺寸依次增大的视差图,其中,所述多级视差处理中的第一级视差处理的输入包括多个具有与该级视差处理相对应的尺寸的图像特征;所述多级视差处理中的除所述第一级视差处理以外的每一级视差处理的输入包括:一个或多个具有与该级视差处理相对应的尺寸的图像特征,以及上一级视差处理所生成的视差图。
根据图1的结构框图所示的视差估计***,可基于提取到的图像对中的各图像的图像特征,进行级联的多级视差处理,得到多个尺寸依次增大的视差图,其中,每一级视差处理的输入可包括具有与该级视差处理相对应的尺寸的图像特征。由此,可一次性获得多个不同尺寸的视差图以供多个不同性能或不同精度要求的目标设备使用,从而可满足不同目标设备对精度与速度的要求,且还可提高视差估计***的灵活性以及适用性。
在本公开中,所述图像对可为通过多目摄像机采集到的针对同一场景的图像对。所述图像对中的各图像的尺寸一致,但对应的视角有所不同。当然,所述图像对也可为采用其它方式获取到的(如从其它第三方设备获取到的)满足要求的图像对。另外,所述图像对中的各图像可为灰度图像或彩色图像。
在本公开中,所述多目摄像机是指配置有两个、三个甚至更多个摄像头的能够进行静态或动态的图像拍摄的摄像机,其可通过所配置的多个摄像头来覆盖不同视角或范围的场景,以增强摄像机对场景中的物体的检测能力。以配置有两个摄像头(例如左、右摄像头)的双目摄像机为例,针对任一场景,所述双目摄像机可通过所配置的两个摄像头获取该场景的两幅尺寸一致但对应的拍摄视角不同的图像(例如左目图像以及右目图像),其中,所述两幅图像构成的图像对可被用于确定该场景中的物体在所述两幅图像中的对应像素点之间的位移(例如水平位移)即视差,以便确定物体的距离等深度信息。
另外,在本公开中,视差估计***100与多目摄像机可相互独立。换言之,视差估计***100可通过所包含的特征提取网络200对多目摄像机采集到的针对同一场景的图像对中的各图像进行特征提取,并通过所包含的视差生成网络300对提取到的图像特征 进行级联的多级视差处理,得到多个尺寸依次增大的视差图。作为替换方案,多目摄像机也可作为所述视差估计***100的一部分。换言之,所述视差估计***100除了可包括特征提取网络200以及视差生成网络300之外,还可包括所述多目摄像机。
根据一些实施例,所述视差估计***100的特征提取网络200所提取到的所述图像对中的各图像的图像特征至少可包括以下特征中的一种或多种:基础结构特征、语义特征、边缘特征、纹理特征、颜色特征、物体形状特征、或基于图像本身的特征。
图2以(a)、(b)和(c)三个图像(例如灰度图像,或者也可为彩色图像)示出了根据本公开示例性实施例的可能提取到的一种图像的基础结构特征的示意图,由图2可以看出,所述基础结构特征可以是指用于反映图像的各种细小的结构的特征。
图3以(a)、(b)、(c)以及(d)四个图像(例如灰度图像,或者也可为彩色图像)示出了根据本公开示例性实施例的可能提取到的一种图像的语义特征的示意图,由图3可以看出,所述语义特征可以是指能够区分图像中的不同对象或不同类别的对象的特征。另外,基于所述语义特征可提高图像的有歧义区域(例如大面积平坦区域)的视差确定的准确性。
图4以(a)、(b)两个图像(例如灰度图像,或者也可为彩色图像)示出了根据本公开示例性实施例的可能提取到的一种图像的边缘特征的示意图,由图4可以看出,所述边缘特征可以是指能够反映图像中的物体或区域的边界信息的特征。
此外,虽然未示出,所述纹理特征、颜色特征以及物体形状特征分别可以是指能够用于反映图像的纹理、颜色、以及图像所包含的物体的形状的特征。所述基于图像本身的特征可以是指所述图像本身,也可以是对所述图像本身进行一定系数或比率的上采样或降采样所得到的图像。所述上采样或降采样的系数或比率例如可为2、3或其它大于1的数值。
根据一些实施例,在本公开中,除了所述基于图像本身的特征之外,其它每种图像特征均可由对应的特征提取子网络对相应图像进行提取所得到,以提高图像特征提取的效率进而提高视差估计的效率。另外,为了提高视差估计的准确性,可至少从基础结构特征、语义特征以及边缘特征三个不同的维度对图像进行特征提取。
例如,根据一些实施例,如图5或图6所示,图5和图6示出了根据本公开示例性实施例的视差估计***100的可能的整体结构框图,所述特征提取网络200可包括多个分别用于提取图像的不同特征的特征提取子网络,所述多个特征提取子网络至少可包括 用于提取图像的基础结构特征的基础结构特征子网络201、用于提取图像的语义特征的语义特征子网络202以及用于提取图像的边缘特征的边缘特征子网络203。
根据一些实施例,所述基础结构特征子网络201可采用VGG(very deep convolutional networks for large scale image recognition,用于大规模图像识别的超深卷积网络)或ResNet(Residual Network,残差网络)等任意的能够用于提取图像的基础结构特征的网络。所述语义特征子网络202可采用DeepLabV3+(encoder-decoder with atrous separable convolution for semantic image segmentation,用于语义图像分割的可分离卷积编码解码器)等任意的能够用于提取图像的语义特征的网络。所述边缘特征子网络203可采用HED(holistically-nested edge detection,整体嵌套边缘检测)网络等任意的能够用于提取图像的边缘特征的网络。根据一些实施方式,HED网络可采用VGG作为主干网络,且,当所述边缘特征子网络203采用HED网络时,所述基础结构特征子网络201可与所述边缘特征子网络203采用同一VGG网络,以达到简化所述特征提取网络的结构的效果。
根据一些实施例,所述特征提取网络200或者所述特征提取网络200所包括的各个特征提取子网络可以是基于训练样本集预先训练好的提取网络,这样,可提高图像特征提取的效率从而提高视差估计的效率。当然,根据实际需求,所述特征提取网络200或者所述特征提取网络200所包括的各个特征提取子网络也可以是基于训练样本集进行实时训练所得到的,或者还可以是基于更新后的训练样本集对预先训练好的提取网络进行实时或定时优化所得到的,以提高特征提取网络所提取特征的准确性。
根据一些实施例,所述特征提取网络200或者所述特征提取网络200所包括的各个特征提取子网络的训练过程可采用有监督训练或者无监督训练,可根据实际需求灵活选取。有监督训练通常是利用已有的训练样本(例如带有标签的数据)学习从输入到输出的映射,然后将这种映射关系应用到未知数据上,达到分类或回归的目的。有监督训练的算法例如可以包括逻辑回归算法,SVM(Support Vector Machine,支持向量机)算法,决策树算法等等。无监督训练与有监督训练的不同之处在于,无监督训练无需训练样本,而是直接对无标签的数据进行建模,找出其中的规律,其典型的算法例如可以包括聚类算法、随机森林(Random forests)算法等等。
根据一些实施例,如前所述,所述多级视差处理中的第一级视差处理的输入可包括多个具有与该级视差处理相对应的尺寸的图像特征;所述多级视差处理中的除第一级视差处理以外的每一级视差处理的输入可包括:一个或多个具有与该级视差处理相对应的尺寸的图像特征。以所述视差生成网络300所得到的所述多个视差图为N个尺寸依次增 大的视差图为例,所述特征提取网络200所提取到的图像特征可包括N种尺寸的图像特征,N为不小于2的正整数。每一种尺寸的图像特征中的至少一部分可被用来帮助生成对应尺寸的视差图,以提高所述视差估计***所得到的所述多个视差图的准确性。根据一些实施例,所述N的取值可为4(如图5或图6所示),当然,还可根据实际需求设置为2、3、5或其它等等。另外,N的取值并非越大越好,而是可在平衡目标设备的精度需求和视差估计***的速度的前提下,选取合适的数值。
在本公开中,各图像(包括所述图像对中的各图像以及视差图等)的尺寸可指的是各图像的单通道的尺寸,可由图像的高度和宽度表示,例如,可表示为H×W,其中,H表示图像的高度、W表示图像的宽度,且二者的单位可为像素。当然,这仅仅是示例,图像的尺寸也可以用其他能够反映图像的像素数量、数据量、存储量、或者清晰度等中的一种或多种参数表示。另外,需要注意的是,对于灰度图像而言,其通道数为1,而对于彩色图像,由于其可具有R、G、B三个颜色通道,因而其通道数可为3,即,彩色图像的实际大小可表示为H×W×3。另外,在本公开中,所述图像对中的各图像的尺寸(即未经降采样和/或上采样等处理的原始图像的尺寸)可根据例如用于采集所述图像对的多目摄像机的传感器的大小和像素数量等参数来确定。
在本公开中,与各级视差处理相对应的尺寸可以是指与各级视差处理所需得到的视差图的尺寸相一致的尺寸。另外,图像特征的尺寸可以是指图像特征本身所构成的图片的单通道的尺寸,或者,进行所需尺寸的图像特征的提取时所基于的被提取图像的尺寸,所述被提取图像可以是指所述图像对中的各图像本身,也可以是对各图像本身进行一定系数或比率的上采样或降采样所得到的图像。例如,以所述图像对中的图像的尺寸为H×W(可被称为全尺寸)为例,针对该图像所提取到的全尺寸的图像特征可以是对该图像本身进行特征提取所得到的图像特征;针对该图像所提取到的
Figure PCTCN2020121824-appb-000001
尺寸(可被称为1/2尺寸)的图像特征可以是对该图像进行2倍的降采样得到1/2尺寸的图像,并对该1/2尺寸的图像进行特征提取所得到的图像特征。
根据一些实施例,所述多级视差处理中除第一级视差处理以外的每一级视差处理的输入除了可包括一个或多个具有与该级视差处理相对应的尺寸的图像特征之外,还可包括上一级视差处理所生成的视差图。换言之,可基于提取到的图像特征中的一个或多个具备对应尺寸的图像特征,对第一级视差处理生成的视差图进行逐级地优化处理得到对应尺寸的视差图。由此,后续得到的视差图的精度可以逐次提高,而无需针对每个精度从头开始计算视差图,这样,可提高多个视差图的整体生成效率。
根据一些实施例,所述特征提取网络200所提取到的所述N种尺寸的图像特征中的最小尺寸的图像特征例如可以包括所述图像对中的第一图像的至少一种图像特征以及第二图像的至少一种图像特征,所述N种尺寸的图像特征中的每一种非最小尺寸的图像特征例如可以包括所述图像对中的所述第一图像的至少一种图像特征和/或所述第二图像的至少一种图像特征。
例如,如图5或图6所示,所述特征提取网络200所提取到的所述N种尺寸的图像特征中的最小尺寸的图像特征可以包括所述图像对中的第一图像(例如左目图像)的基础结构特征、语义特征、边缘特征以及所述图像对中的第二图像(例如右目图像)的基础结构特征。所述特征提取网络200所提取到的所述N种尺寸的图像特征中的各非最小尺寸的图像特征可以包括所述图像对中的第一图像的边缘特征或基于所述第一图像本身的特征。
根据一些实施例,所述特征提取网络200所提取到的每个图像的每种图像特征可具备一种或多种尺寸,所述多种尺寸的尺寸数量可小于或等于N。例如,如图5或图6所示,所述N的取值可为4,所述特征提取网络200所提取到的所述第一图像的边缘特征以及基于所述第一图像本身的特征可以分别具备两种尺寸,所提取到的所述第一图像的基础结构特征以及语义特征可以分别具备一种尺寸,所提取到的所述第二图像的基础结构特征可以具备一种尺寸。另外,图5或图6仅是一种示例,所述特征提取网络200所提取到的每个图像的每种图像特征除了可具备示出的一种或两种尺寸之外,还可以具备更多种尺寸。例如,以所述N的取值为4为例,所述特征提取网络200所提取到的所述第一图像的边缘特征还可以具备三种或四种尺寸,对此不作限制。
根据一些实施例,所述特征提取网络200在提取到所述图像对中的各图像的图像特征之后,可将其存储(例如缓存)在存储设备或存储介质中,以供后续读出并使用。另外,所述特征提取网络200在对所述图像对中的各图像进行图像特征提取之前,还可对所述图像对中的图像进行极线校正,使得所述图像对中的图像仅在一个方向(例如水平方向或垂直方向)存在视差。由此,图像的视差搜索范围可仅限于一个方向,从而提高后续特征提取以及视差生成的效率。作为替换方案,所述图像对中的图像的极线校正操作也可由多目摄像机或其它第三方设备执行。例如,多目摄像机采集到所述图像对后,可对所述图像对中的图像进行极线校正,并将极线校正后的图像对发送给所述视差估计***。或者,多目摄像机采集到所述图像对后,可将其发送给其它第三方设备,由所述 其它第三方设备对所述图像对中的图像进行极线校正并将极线校正后的图像对发送给所述视差估计***。
根据一些实施例,所述视差生成网络300所得到的所述多个视差图中的具备最大尺寸的视差图可与所述图像对中的各图像的尺寸(即各图像的原始尺寸)相一致。从而,通过级联的多级视差处理可至少得到一对应的尺寸与所述图像对中的各图像的尺寸相一致的精度相对较高的视差图以及具备其它精度的视差图,在提高视差估计***的灵活性和可适用性的基础上,可更好地满足高性能的目标设备对于视差估计***所生成的视差图的精度的要求。作为替换方案,所述多个视差图中的各视差图的尺寸也可均小于所述图像对中的各图像的尺寸。
根据一些实施例,所述多个视差图中的任意两个相邻视差图中的后一个视差图的高度和宽度可分别为前一个视差图的高度和宽度的2倍。示例的,以所述多个视差图为4个,且所述多个视差图中的最后一个视差图的尺寸为H×W(可与所述图像对中的各图像的尺寸一致)为例,排在其之前的所述多个视差图中的其它各视差图的尺寸可依次为
Figure PCTCN2020121824-appb-000002
(若H×W可被称为全尺寸,则
Figure PCTCN2020121824-appb-000003
可被称为1/2尺寸),
Figure PCTCN2020121824-appb-000004
(可被称为1/4尺寸),以及
Figure PCTCN2020121824-appb-000005
(可被称为1/8尺寸)。换言之,在本公开中,以数值2作为相邻视差图的高度和宽度的缩放步长(或者说是相邻视差图的上采样或降采样的系数或比率)。作为替换方案,所述多个视差图中的任意两个相邻视差图中的后一个视差图的高度和宽度还可分别为前一个视差图的高度和宽度的3倍、4倍或其它大于1的正整数倍数,可以根据实际所需的精度来选取合适的数值。
根据一些实施例,如前所述,所述特征提取网络200所提取到的图像特征可包括N种尺寸的图像特征,所述N为不小于2的正整数。相应地,所述视差生成网络可被配置为,在所述多级视差处理的第一级视差处理中,根据所述N种尺寸的图像特征中的最小尺寸的图像特征的至少一部分,生成具备所述最小尺寸的初始视差图;以及在所述多级视差处理的后续每一级视差处理中,根据所述N种尺寸的图像特征中的具备对应尺寸的图像特征的至少一部分,对上一级视差处理所生成的视差图进行视差优化处理,生成具备所述对应尺寸的优化视差图,其中,所述多个视差图至少可包括各个优化视差图。
根据一些实施例,所述多级视差处理可包括N+1级视差处理。所述视差生成网络可被配置为,在除第一级视差处理以外的N级视差处理中,按照尺寸从小到大的顺序,依次基于所述N种尺寸的图像特征中的具备对应尺寸的图像特征的至少一部分,对上一级视差处理所生成的视差图进行视差优化处理,得到N个尺寸依次增大的优化视差图,并 将所述N个优化视差图作为所述多个视差图,其中,所述N个优化视差图的尺寸分别与所述N种尺寸对应。
例如,如图5所示,所述多级视差处理可包括4+1级视差处理。所述提取到的图像特征可包括4种尺寸的图像特征,其中,图5以提取到的所述4种尺寸的图像特征分别为1/8尺寸
Figure PCTCN2020121824-appb-000006
1/4尺寸
Figure PCTCN2020121824-appb-000007
1/2尺寸
Figure PCTCN2020121824-appb-000008
以及全尺寸(H×W,所述全尺寸可以是指与所述图像对中的原始图像的尺寸相一致的尺寸)为例进行示意说明。所述视差生成网络可被配置为,在所述多级视差处理的第一级视差处理中,根据所述4种尺寸的图像特征中的最小尺寸(即1/8尺寸)的图像特征的至少一部分(例如提取到的第一图像的1/8尺寸的基础结构特征、1/8尺寸的语义特征、1/8尺寸的边缘特征以及第二图像的1/8尺寸的基础结构特征的部分或全部),生成具备所述最小尺寸(即1/8尺寸)的初始视差图。在除第一级视差处理以外的4级视差处理中,按照尺寸从小到大的顺序,依次基于所述4种尺寸的图像特征中的具备对应尺寸的图像特征的至少一部分(例如,依次基于提取到的第一图像的1/8尺寸的边缘特征的部分或全部、提取到的第一图像的1/4尺寸的边缘特征的部分或全部、提取到的1/2尺寸的基于第一图像本身的特征、以及提取到的全尺寸的基于第一图像本身的特征),对上一级视差处理所生成的视差图进行视差优化处理,得到4个尺寸依次增大的优化视差图(例如,得到具备1/8尺寸的优化视差图、具备1/4尺寸的优化视差图、具备1/2尺寸的优化视差图以及具备全尺寸的优化视差图),并将所述4个优化视差图作为所述多个视差图。
通过上述描述可知,在该实施例中,所述视差估计***100所得到的所述多个视差图可以不包括所述多级视差处理中的第一级视差处理所生成的初始视差图,而是包括对所述第一级视差处理所生成的初始视差图进行逐次优化后的各优化视差图,由此,可提高所述视差估计***所得到的所述多个视差图的精度。
根据另一些实施例,所述多级视差处理可包括N级视差处理。所述视差生成网络可被配置为,在除第一级视差处理以外的N-1级视差处理中,按照尺寸从小到大的顺序,依次基于所述N种尺寸的图像特征中的N-1种非最小尺寸的图像特征中的具备对应尺寸的图像特征的至少一部分,对上一级视差处理所生成的视差图进行视差优化处理,得到N-1个尺寸依次增大的优化视差图,并将所述初始视差图和所述N-1个优化视差图作为所述多个视差图,其中,所述初始视差图和所述N-1个优化视差图的尺寸分别与所述N种尺寸对应。
例如,如图6所示,所述多级视差处理可包括4级视差处理。所述提取到的图像特征可包括4种尺寸的图像特征,其中,图6以提取到的所述4种尺寸的图像特征分别为1/8尺寸
Figure PCTCN2020121824-appb-000009
1/4尺寸
Figure PCTCN2020121824-appb-000010
1/2尺寸
Figure PCTCN2020121824-appb-000011
以及全尺寸(H×W,所述全尺寸可以是指与所述图像对中的各图像的尺寸相一致的尺寸)为例进行示意说明。所述视差生成网络可被配置为,在所述多级视差处理的第一级视差处理中,根据所述4种尺寸的图像特征中的最小尺寸(即1/8尺寸)的图像特征的至少一部分(例如提取到的第一图像的1/8尺寸的基础结构特征、1/8尺寸的语义特征、1/8尺寸的边缘特征以及第二图像的1/8尺寸的基础结构特征的部分或全部),生成具备所述最小尺寸(即1/8尺寸)的初始视差图。在除第一级视差处理以外的其它3级视差处理中,按照尺寸从小到大的顺序,依次基于其它3种非最小尺寸的图像特征中的具备对应尺寸的图像特征的至少一部分(例如,依次基于提取到的第一图像的1/4尺寸的边缘特征的部分或全部、提取到的1/2尺寸的基于第一图像本身的特征、以及提取到的全尺寸的基于第一图像本身的特征),对上一级视差处理所生成的视差图进行视差优化处理,得到3个尺寸依次增大的优化视差图(例如,得到具备1/4尺寸的优化视差图、具备1/2尺寸的优化视差图以及具备全尺寸的优化视差图),并将所述初始视差图以及所述3个优化视差图作为所述多个视差图。
通过上述描述可知,在该实施例中,所述视差估计***100所得到的所述多个视差图还可以包含所述多级视差处理中的第一级视差处理所生成的初始视差图,以提高所述视差估计***的处理效率。
根据一些实施例,所述视差生成网络300可被配置为,在所述多级视差处理中的除第一级视差处理以外的每一级视差处理中,基于具备对应尺寸的图像特征的至少一部分,对上一级视差处理所生成的视差图进行残差(residual)计算处理,得到具备所述对应尺寸的残差图,并将具备所述对应尺寸的残差图和所述上一级视差处理所生成的视差图进行组合以得到具备所述对应尺寸的优化视差图。
例如,以图5所示,在除第一级视差处理以外的其它4级视差处理的对应1/8尺寸的视差处理中,可基于提取到的第一图像的1/8尺寸的边缘特征的部分或全部,以及上一级视差处理所生成的初始视差图(1/8尺寸)计算得到1/8尺寸的第一残差图,并将所述第一残差图与所述初始视差图进行组合(例如相加)得到1/8尺寸的第一优化视差图作为该级视差处理的输出。在下一级对应1/4尺寸的视差处理中,可基于提取到的第一图像的1/4尺寸的边缘特征的部分或全部,以及上一级视差处理所生成的第一优化视差 图(1/8尺寸)计算得到1/4尺寸的第二残差图,并将所述第二残差图与所述第一优化视差图进行组合(例如,将所述第二残差图与所述第一优化视差图的1/4尺寸的上采样版本进行相加)得到1/4尺寸的第二优化视差图作为该级视差处理的输出,等等。更具体的示例将在后文讨论。
根据一些实施例,所述视差生成网络300还可被配置为,在所述多级视差处理中的除第一级视差处理以外的每一级视差处理中,在对上一级视差处理所生成的视差图进行视差优化处理之前,响应于所述上一级视差处理所生成的视差图的尺寸小于本级视差处理对应的尺寸,将所述上一级视差处理所生成的视差图上采样为与本级视差处理相对应的尺寸。所述上采样所采用的算法例如可以包括最邻近插值算法、双线性插值算法、解卷积算法,等等。这样,每一级视差优化处理所针对的视差图均可以是具备与该级视差处理相对应的尺寸的视差图。
示例的,如图5所示,在除第一级视差处理以外的其它4级视差处理的对应1/8尺寸的视差处理中,由于上一级视差处理所生成的初始视差图的尺寸不小于本级视差处理,因而可直接基于提取到的第一图像的1/8尺寸的边缘特征的部分或全部,对上一级视差处理所生成的初始视差图进行视差优化处理得到1/8尺寸的第一优化视差图。在下一级对应1/4尺寸的视差处理中,则可首先将上一级视差处理所生成的1/8尺寸的第一优化视差图上采样为与本级视差处理相对应的1/4尺寸,之后,可基于提取到的第一图像的1/4尺寸的边缘特征的部分或全部,对上采样后的1/4尺寸的第一优化视差图进行视差优化处理得到1/4尺寸的第二优化视差图。例如,如前所述,在该下一级对应1/4尺寸的视差处理中,可基于提取到的第一图像的1/4尺寸的边缘特征的部分或全部,以及上采样后的1/4尺寸的第一优化视差图计算得到1/4尺寸的残差图,并将所述1/4尺寸的残差图与所述上采样后的1/4尺寸的第一优化视差图进行相加得到1/4尺寸的第二优化视差图,等等。
根据一些实施例,生成不同优化视差图所基于的图像特征可以为相同种类的图像特征或不同种类的图像特征;和/或,生成不同优化视差图所基于的图像特征可以为所述图像对中的相同图像或不同图像的图像特征。例如,以图5所示,生成前两个不同的优化视差图所基于的图像特征可以为所述图像对中的相同图像(例如第一图像)的相同种类的图像特征(例如边缘特征);生成中间两个不同的优化视差图所基于的图像特征可以为相同图像(例如第一图像)的不同种类的图像特征(例如,一个为第一图像的边缘特 征、另一个为第一图像的基于图像本身的特征),等等。由此,通过灵活地选取各级视差优化所基于的图像特征,可进一步提高视差估计***的灵活性以及可适用性。
根据一些实施例,生成各优化视差图所基于的图像特征例如可以包括所述图像对中的至少一个图像的边缘特征,和/或,基于所述图像对中的至少一个图像的图像本身的特征。
示例的,以图5所示,在所述多个视差处理中的除第一级视差处理以外的前两级对应的尺寸相对较小的视差优化处理中,可基于第一图像的边缘特征生成对应尺寸的优化视差图;在后两级对应的尺寸相对较大的视差优化处理中,可使用基于第一图像本身的特征来代替边缘特征对上一级视差处理生成的视差图进行视差优化处理,以降低大尺寸的图像特征提取所需的计算量,提高视差估计***的处理效率。当然,图5仅是一种示例,对于每一视差优化处理而言,生成对应的优化视差图所基于的图像特征除了可为边缘特征或基于图像本身的特征之外,还可以为二者的结合,或者还可以为其它提取到的一个或多个图像特征的组合,等等。
根据一些实施例,基于所述图像对中的至少一个图像的图像本身的特征例如可以包括所述至少一个图像的图像本身,或者根据所需生成的优化视差图的尺寸对所述至少一个图像的图像本身进行降采样所得到的图像。所述降采样的过程例如可以为:对于一幅尺寸为H×W的图像而言,若采用的降采样系数或比率为K,则可在所述图像原图中的每行每列中每隔K个点选取一个点组成一幅图像。所述降采样系数或比率如前所述可为2、3或其它大于1的数值。当然,这仅仅是示例,降采样也可以采用其它方式来实现,例如K个点取平均。
以图5所示,在所述多个视差处理中的后两级对应1/2尺寸以及全尺寸的视差优化处理中,可分别使用基于降采样系数2对第一图像本身进行降采样所得到的1/2尺寸的第一图像、以及第一图像本身来代替对应尺寸的边缘特征对上一级视差处理生成的视差图进行视差优化处理,以降低大尺寸的图像特征提取所需的计算量,提高视差估计***的处理效率。
根据一些实施例,如图5或图6所示,所述视差生成网络300可包括初始视差生成子网络301以及至少一个视差优化子网络302,所述初始视差生成子网络301以及所述至少一个视差优化子网络302中的各视差优化子网络302依次级联,所述初始视差生成子网络301被配置为进行第一级视差处理,所述至少一个视差优化子网络302被配置为进行除第一级视差处理以外的各级视差处理。
结合前述实施例可知,所述视差估计***100的图像特征提取、以及多级视差处理中的每一级视差处理均可由对应的子网络实现。
示例的,以下将以图5为例,对包括多个特征提取子网络(例如基础结构特征子网络201、语义特征子网络202以及边缘特征子网络203)、初始视差生成子网络301以及多个(例如四个)视差优化子网络302的视差估计***100的整体工作过程进行示意说明。
如图5可知,针对输入至所述视差估计***100的图像对中的第一图像I1以及第二图像I2(尺寸分别可为H×W),所述视差估计***100可基于其所包括的特征提取网络200中的多个特征提取子网络对其进行后续多级视差处理所需尺寸的图像特征的提取。例如,可基于基础结构特征子网络201提取第一图像I1的1/8尺寸
Figure PCTCN2020121824-appb-000012
的基础结构特征以及第二图像I2的1/8尺寸的基础结构特征,基于语义特征子网络202提取第一图像I1的1/8尺寸的语义特征,基于边缘特征子网络203提取第一图像I1的1/8尺寸的边缘特征以及1/4尺寸
Figure PCTCN2020121824-appb-000013
的边缘特征。另外,除了以上图像特征之外,所述视差估计***100的特征提取网络200还可以提取1/2尺寸
Figure PCTCN2020121824-appb-000014
的基于第一图像I1本身的特征、以及全尺寸(H×W)的基于第一图像I1本身的特征,即第一图像I1本身。
所述第一图像I1的1/8尺寸的基础结构特征、第二图像I2的1/8尺寸的基础结构特征、第一图像I1的1/8尺寸的语义特征以及第一图像I1的1/8尺寸的边缘特征可由对应的特征提取子网络输出至初始视差生成子网络301以进行第一级视差处理,得到具备1/8尺寸的初始视差图dispS1。之后,与所述初始视差生成子网络301依次级联的四个视差优化子网络302可分别基于特征提取网络200提取到的具备对应尺寸的图像特征对所述1/8尺寸的初始视差图dispS1依次进行不同级别的视差优化处理,以得到多个尺寸依次增大的优化视差图。
例如,第一个视差优化子网络可根据来自边缘特征子网络203的第一图像I1的1/8尺寸的边缘特征(部分或全部),对初始视差生成子网络301输出的1/8尺寸的初始视差图dispS1进行视差优化处理,得到1/8尺寸的第一优化视差图dispS1_refine。根据一些实施例,第一个视差优化子网络可基于第一图像I1的1/8尺寸的边缘特征(部分或全部)以及1/8尺寸的初始视差图dispS1计算得到1/8尺寸的第一残差图,并将1/8尺寸的第一残差图和1/8尺寸的初始视差图dispS1进行相加操作,得到1/8尺寸的第一优化视差图dispS1_refine。
第二个视差优化子网络可根据来自边缘特征子网络203的第一图像I1的1/4尺寸的边缘特征(部分或全部),对第一个视差优化子网络输出的1/8尺寸的第一优化视差图dispS1_refine进行视差优化处理,得到1/4尺寸的第二优化视差图dispS2_refine。根据一些实施例,第二个视差优化子网络可将第一个视差优化子网络输出的1/8尺寸的第一优化视差图上采样为与本级视差处理相对应的1/4尺寸,之后,基于第一图像I1的1/4尺寸的边缘特征(部分或全部)以及上采样后的1/4尺寸的第一优化视差图计算得到1/4尺寸的第二残差图,并将1/4尺寸的第二残差图和上采样后的1/4尺寸的第一优化视差图进行相加操作,得到1/4尺寸的第二优化视差图dispS2_refine。
第三个视差优化子网络可根据特征提取网络200提取到的1/2尺寸的基于第一图像I1本身的特征(部分或全部),对第二个视差优化子网络输出的1/4尺寸的第二优化视差图dispS2_refine进行视差优化处理,得到1/2尺寸的第三优化视差图dispS3_refine。根据一些实施例,第三个视差优化子网络可将第二个视差优化子网络输出的1/4尺寸的第二优化视差图上采样为与本级视差处理相对应的1/2尺寸,之后,基于1/2尺寸的基于第一图像本身的特征(部分或全部)以及上采样后的1/2尺寸的第二优化视差图计算得到1/2尺寸的第三残差图,并将1/2尺寸的第三残差图和上采样后的1/2尺寸的第二优化视差图进行相加操作,得到1/2尺寸的第三优化视差图dispS3_refine。
第四个视差优化子网络可根据特征提取网络200提取到的全尺寸的基于第一图像I1本身的特征(部分或全部),对第三个视差优化子网络输出的1/2尺寸的第三优化视差图dispS3_refine进行视差优化处理,得到全尺寸的第四优化视差图dispS4_refine。根据一些实施例,第四个视差优化子网络可将第三个视差优化子网络输出的1/2尺寸的第三优化视差图上采样为与本级视差处理相对应的全尺寸,之后,基于全尺寸的基于第一图像本身的特征(部分或全部)以及上采样后的全尺寸的第三优化视差图计算得到全尺寸的第四残差图,并将全尺寸的第四残差图和上采样后的全尺寸的第三优化视差图进行相加操作,得到全尺寸的第四优化视差图dispS4_refine。需要说明的是,在本例子中,第三个和第四个视差优化子网络使用的是基于第一图像本身的特征来进行视差优化处理以减小计算量,然而,它们中的一者或两者也可以使用第一图像的边缘特征或其它特征。同理,如果需要进一步减小计算量,第一个和/或第二个视差优化子网络也可以使用基于第一图像本身的特征来代替提取出的边缘特征等等。
所述1/8尺寸的第一优化视差图dispS1_refine、1/4尺寸的第二优化视差图dispS2_refine、1/2尺寸的第三优化视差图dispS3_refine以及全尺寸的第四优化视差图 dispS4_refine即可作为图5所示的视差估计***100所得到的多个尺寸依次增大的视差图。
另外,图6所示的视差估计***100的整体工作过程与图5所示的视差估计***100的整体工作过程类似,区别仅在于初始视差生成子网络301生成的初始视差图的尺寸小于第一个视差优化子网络所生成的优化视差图的尺寸、以及可将初始视差生成子网络301生成的初始视差图作为视差估计***100所得到的多个尺寸依次增大的视差图中的一个,因而不再赘述。
根据一些实施例,所述初始视差生成子网络301以及所述至少一个视差优化子网络302中的每一视差优化子网络302可以为2DCNN(二维深度卷积神经网络)或3DCNN(三维深度卷积神经网络)等任意的能够实现相应的视差处理功能的卷积神经网络。采用卷积神经网络作为视差处理子网络,可获得较大的感知域,从而可提高视差估计***所得到的视差图的精度。
根据一些实施例,对于所述初始视差生成子网络301而言,若其采用2DCNN结构获取视差,则所述初始视差生成子网络301可包括第一数量个(例如5个,当然还可根据实际需求灵活选取相应数值)依次级联的卷积层(convolution layer)。各卷积层的卷积方式例如可采用深度可分离卷积(depthwise separable convolution)等。
示例的,以下将通过表1对一种可应用于图5所示的视差估计***的、包括5个依次级联的卷积层(例如表1中的conv1-conv5)的2DCNN结构的初始视差生成子网络301进行示意说明。作为示例,该子网络采用了MobileNetV2网络架构。
表1:初始视差生成子网络301的一种可能的2DCNN网络结构的相关描述
Figure PCTCN2020121824-appb-000015
Figure PCTCN2020121824-appb-000016
结合表1和图5可知,corr1d层可用于对图5中的特征提取网络200提取到的第一图像的1/8尺寸的基础结构特征以及第二图像的1/8尺寸的基础结构特征进行相关操作。semanS1_conv层可用于基于3×3的卷积核对提取到的第一图像的1/8尺寸的语义特征进行卷积处理。edgeS1_conv层可用于基于3×3的卷积核对提取到的第一图像的1/8尺寸的边缘特征进行卷积处理。concat层可用于对corr1d、semanS1_conv以及edgeS1_conv输出的特征进行合并处理。
此外,conv1-conv5层中涉及到的MB_conv操作指的是MobileNetV2中的深度可分离卷积(depthwise separable convolution)操作,MB_conv_res操作指的是MobileNetV2中的残差深度可分离卷积(residual depthwise separable convolution)操作。换言之,conv1层、conv2层与conv4层分别可用于对上一层输出的特征进行深度可分离卷积操作,conv3层与conv5层分别可用于对上一层输出的特征进行残差深度可分离卷积操作。另外,dispS1层可用于对上一层输出的特征进行soft argmin计算,得到对应尺寸(即1/8尺寸)的初始视差图dispS1。
需要注意的是,表1中提及的H与W可分别表示输入视差估计***100的图像对中的图像的高度与宽度,D可表示图像的最大视差范围,三者单位均可为像素,其中,D的取值可与用于采集所述图像对的多目摄像机中的各摄像头的焦距和/或各摄像头之间的间距有关。另外,采用2DCNN结构的初始视差生成子网络301的卷积层的数量可以是根据concat层所得到的特征的数量而定的。例如,若concat层所得到的特征的数量较多,则还可以增加所述初始视差生成子网络301所包含的卷积层的数量。
作为替换方案,所述初始视差生成子网络301还可以采用3DCNN结构获取视差,采用3DCNN结构的所述初始视差生成子网络301可包括第二数量个(例如7个,当然还可根据实际需求灵活选取相应数值)依次级联的卷积层。
示例的,以下将通过表2对一种可应用于图5所示的视差估计***的、包括7个依次级联的卷积层(例如表2中的conv1-conv7)的3DCNN结构的初始视差生成子网络301进行示意说明。
表2:初始视差生成子网络301的一种可能的3DCNN网络结构的相关描述
Figure PCTCN2020121824-appb-000017
Figure PCTCN2020121824-appb-000018
结合表2和图5可知,edgeS1_conv层可用于基于3×3的卷积核对提取到的第一图像的1/8尺寸的边缘特征进行卷积处理。semanS3_conv层可用于基于3×3的卷积核对提取到的第一图像的1/8尺寸的语义特征进行卷积处理。concat层可用于对featS1、semanS1_conv以及edgeS1_conv输出的特征进行合并处理,其中,虽然表2中未示出,featS1可以是指提取到的第一图像的1/8尺寸的基础结构特征以及第二图像的1/8尺寸的基础结构特征。
此外,cost层可用于对concat层输出的特征进行平移操作。conv1层至conv7层可分别用于基于3×3×3的卷积核对上一层输出的特征进行卷积操作,其中,conv2层、conv4层以及conv6层可相当于是3DCNN网络的残差模块,还可分别用于对上一层输出的特征进行卷积操作之后,将卷积结果与上一层输出的结果进行相加处理。dispS1层可用于对上一层输出的特征进行soft argmin计算,得到对应尺寸(即1/8尺寸)的初始视差图dispS1。
与表1类似,表2中提及的H与W可分别表示输入视差估计***100的图像对中的图像的高度与宽度。另外,F可表示特征的通道数,1F表示通道数为F,3F表示通道数为3×F,等等。此外,采用3DCNN结构的初始视差生成子网络301的卷积层的数量也可以根据concat层所得到的特征的数量而定。例如,若concat层所得到的特征的数量较多,则还可以增加所述初始视差生成子网络301所包含的卷积层的数量。
根据一些实施例,所述至少一个视差优化子网络302中的每一视差优化子网络302所包括的卷积层的数量可小于所述初始视差生成子网络301所包括的卷积层的数量。例 如,以每一视差优化子网络302采用2DCNN结构为例,每一视差优化子网络302所包括的卷积层的数量可为3个,当然,也可根据实际需求设置为其它数值。另外,参见前述实施例有关描述,每一视差优化子网络302还可采用3DCNN结构,对此不做限制。
以下将通过表3至表6对可适用于图5所示的视差估计***的多个视差优化子网络302的结构进行示意说明,其中,表3至表6依次对图5所示的视差估计***的第一至第四个视差优化子网络的可能的2DCNN网络结构进行了相应描述。
表3:第一个视差优化子网络的一种可能的2DCNN网络结构的相关描述
Figure PCTCN2020121824-appb-000019
与前述表1等类似,edgeS1_conv层可用于基于3×3的卷积核对提取到的第一图像的1/8尺寸的边缘特征进行卷积处理。concat层可用于对上一级视差处理(即初始视差生成处理)生成的1/8尺寸的初始视差图dispS1,以及edgeS1_conv输出的特征进行合并处理。
此外,conv1层与conv3层分别可用于对上一层输出的特征进行深度可分离卷积操作,conv2层可用于对上一层输出的特征进行残差深度可分离卷积操作。另外,dispS1_refine层可用于对上一层conv3输出的特征,以及上一级视差处理生成的1/8尺寸的初始视差图dispS1进行叠加运算,得到对应尺寸(即1/8尺寸)的第一优化视差图dispS1_refine。
表4:第二个视差优化子网络的一种可能的2DCNN网络结构的相关描述
Figure PCTCN2020121824-appb-000020
Figure PCTCN2020121824-appb-000021
由表4以及图5可知,dispS1_up层可用于对上一级视差处理(即第一级视差优化处理)生成的1/8尺寸的第一优化视差图dispS1_refine进行上采样处理,得到1/4尺寸的优化视差图dispS1_up。edgeS2_conv层可用于基于3×3的卷积核对提取到的第一图像的1/4尺寸的边缘特征进行卷积处理。concat层可用于对上采样处理后的1/4尺寸的优化视差图dispS1_up以及edgeS2_conv输出的特征进行合并处理。
此外,conv1层与conv3层分别可用于对上一层输出的特征进行深度可分离卷积操作,conv2层可用于对上一层输出的特征进行残差深度可分离卷积操作。另外,dispS2_refine层可用于对上一层conv3输出的特征以及上采样处理后的1/4尺寸的优化视差图dispS1_up进行叠加运算,得到对应尺寸(即1/4尺寸)的第二优化视差图dispS2_refine。
表5:第三个视差优化子网络的一种可能的2DCNN网络结构的相关描述
Figure PCTCN2020121824-appb-000022
由表5以及图5可知,dispS2_up层可用于对上一级视差处理(即第二级视差优化处理)生成的1/4尺寸的第二优化视差图dispS2_refine进行上采样处理,得到1/2尺寸的优化视差图dispS2_up。imgS3层可用于对第一图像本身进行降采样处理,得到1/2尺寸的基于第一图像本身的特征,其中,表5中的I1表示第一图像。concat层可用于对上采样处理后的1/2尺寸的优化视差图dispS2_up以及imgS3输出的特征进行合并处理。
此外,conv1层、conv2层与conv3层分别可用于对上一层输出的特征进行卷积操作,dispS3_refine层可用于对上一层conv3输出的特征以及上采样处理后的1/2尺寸的优化视差图dispS2_up进行叠加运算,得到对应尺寸(即1/2尺寸)的第三优化视差图dispS3_refine。
表6:第四个视差优化子网络的一种可能的2DCNN网络结构的相关描述
Figure PCTCN2020121824-appb-000023
Figure PCTCN2020121824-appb-000024
由表6以及图5可知,dispS3_up层可用于对上一级视差处理(即第三级视差优化处理)生成的1/2尺寸的第三优化视差图dispS3_refine进行上采样处理,得到全尺寸的优化视差图dispS3_up。concat层可用于对上采样处理后的全尺寸的优化视差图dispS3_up以及第一图像本身进行合并处理,其中,表6中的I1表示第一图像。
此外,conv1层、conv2层与conv3层分别可用于对上一层输出的特征进行卷积操作,dispS4_refine层可用于对上一层conv3输出的特征以及上采样处理后的全尺寸的优化视差图dispS3_up进行叠加运算,得到对应尺寸(即全尺寸)的第四优化视差图dispS4_refine。
需要注意的是,与前述实施例类似,表3至表6中提及的H与W可分别表示输入视差估计***100的图像对中的图像的高度与宽度。另外,采用2DCNN结构的各视差优化子网络302的卷积层的数量也可以是根据concat层所得到的特征的数量而定的。例如,若concat层所得到的特征的数量较多,则还可以增加各视差优化子网络302所包含的卷积层的数量。
根据一些实施例,所述初始视差生成子网络301以及所述至少一个视差优化子网络302中的每一子网络可以是基于训练样本集预先训练好的网络,这样,可提高视差处理的效率。当然,根据实际需求,所述初始视差生成子网络301以及所述至少一个视差优化子网络302中的每一子网络也可以是基于训练样本集进行实时训练所得到的,或者还可以是基于更新后的训练样本集对预先训练好的网络进行实时或定时优化所得到的,以提高视差生成的准确性。
根据一些实施例,所述初始视差生成子网络301以及所述至少一个视差优化子网络302中的每一子网络的训练过程也可采用有监督训练或者无监督训练,可根据实际需求灵活选取。有监督训练和无监督训练的介绍可参考前述有关实施例中的相关描述,此处不再赘述。
根据一些实施例,所述初始视差生成子网络301以及所述至少一个视差优化子网络302中的每一子网络还可被配置为计算损失函数,所述损失函数可用于表示所述子网络生成的视差图中的视差与对应的真实视差之间的误差。这样,通过计算损失函数,可明确 视差估计***生成的各个视差图的精确度。另外,还可基于损失函数对相应***进行优化。
根据一些实施例,以所述初始视差生成子网络301以及所述至少一个视差优化子网络302中的每一子网络采用有监督训练为例,每一视差处理子网络或每一级视差处理输出的损失函数可定义为L n=f(Disp GTn-Disp Sn)+g(Disp Sn)。n的取值为1至N(对应图5所示的视差估计***),或0至N(对应图6所示的视差估计***)。函数f表示预测的视差(Disp Sn)与真实视差(Disp GTn)的差异,g表示视差连续性约束。
Figure PCTCN2020121824-appb-000025
Figure PCTCN2020121824-appb-000026
g(x)=|x x|+|x y|。另外,还可考虑边缘特征作为损失函数的正则项,对此不作限制。相应地,所述视差估计***100的最终的损失函数可为各视差处理子网络或各级视差处理输出的损失函数之和。
根据另一些实施例,若所述初始视差生成子网络301以及所述至少一个视差优化子网络302中的每一子网络采用无监督训练,则可通过对图像进行重构、计算重构误差得到各视差处理子网络或各级视差处理的损失函数。例如,以计算其中一级视差处理子网络的损失函数为例,所计算得到的损失函数可表示为
Figure PCTCN2020121824-appb-000027
其中,warpI 1=warp(I 2,Disp 1),warp函数表示根据该级视差处理子网络计算出的视差将第二图像I2重构为一张I1图像。
以下将以所述初始视差生成子网络301以及所述至少一个视差优化子网络302中的每一子网络采用有监督训练,训练集使用Scene Flow(场景流),且视差估计***的结构如图5所示为例,结合图7A、图7B以及图8对训练所基于的参考图像、对应的真值视差图、以及将训练后的参数应用在Middlebury数据集图片上测试得到的结果进行示意说明。
图7A和图7B分别示出了根据本公开示例性实施例的网络训练时所基于的参考图像以及对应的真值视差图(ground truth)的示意图,图8示出了根据本公开示例性实施例的采用训练后的视差估计***对图7A所示的参考图像进行级联的多级视差处理所得到的从右到左尺寸依次增大的多个视差图的示意图(即将训练后的参数应用在Middlebury数据集图片上测试得到的结果)。由上述附图可知,所得到的多个视差图的尺寸可依次增大,精度可依次增高,且最大尺寸的视差图的精度接近真值视差图。此外,尽管图7A、图7B以及图8分别以灰度图像的方式对参考图像、真值视差图以及生成的多个视差图进行了示意,可以理解的是,当图7A所示的参考图像为彩色图像时,图7B以及图8所示出的各视差图也可为相应的彩色图像。
根据一些实施例,所述视差生成网络300还可被配置为根据目标设备的性能,从所述多个视差图中选取其尺寸与所述目标设备的性能相匹配的视差图作为提供给所述目标设备的视差图。例如,若所述目标设备的性能较高,和/或其所需的视差图的精度较高,则可从所述多个视差图中选取尺寸较大的视差图提供给所述目标设备。另外,所述目标设备也可根据自身性能,从所述视差估计***所得到的所述多个视差图中主动获取其所需的视差图,对此不作限制。
此外,尽管未示出,所述视差估计***所得到的所述多个视差图还可被提供给相应的目标设备进行进一步处理,例如提供给相应的目标设备以便所述目标设备基于所述视差图计算得到深度图,并进而得到场景的深度信息,以应用于三维重建、自动驾驶、障碍物检测等各种应用场景。
以上已经结合图1~图8描述了根据本公开的视差估计的示例性***。下面将结合图9、图10以及图11对本公开的视差估计的示例性方法、示例性电子设备的示例性实施例进行进一步描述。需要注意的是,前文中参照图1~图8描述的各种定义、实施例、实施方式和例子等也均可适用于之后描述的示例性实施例或与其进行组合。
根据一些实施例,图9示出了根据本公开的示例性实施例的视差估计方法的流程图。如图9所示,本公开的视差估计方法可包括以下步骤:对图像对中的各图像进行特征提取(步骤S901);以及根据提取到的图像特征,进行级联的多级视差处理,得到多个尺寸依次增大的视差图(步骤S902),其中,所述多级视差处理中的第一级视差处理的输入包括多个具有与该级视差处理相对应的尺寸的图像特征;所述多级视差处理中的除所述第一级视差处理以外的每一级视差处理的输入包括:一个或多个具有与该级视差处理相对应的尺寸的图像特征,以及上一级视差处理所生成的视差图。
根据一些实施例,所述图像对可为通过多目摄像机采集到的针对同一场景的图像对。所述图像对中的各图像的尺寸一致,但对应的视角有所不同。另外,所述图像对中的各图像可为灰度图像或彩色图像。
根据一些实施例,所提取到的所述图像对中的各图像的图像特征至少可包括以下特征中的一种或多种:基础结构特征、语义特征、边缘特征、纹理特征、颜色特征、物体形状特征、或基于图像本身的特征。例如,所提取到的所述图像对中的第一图像(例如左目图像)的图像特征可包括基础结构特征、语义特征以及边缘特征,所提取到的所述图像对中的第二图像(例如右目图像)的图像特征可包括基础结构特征。或者,所提取 到的所述图像对中的第一图像以及第二图像的图像特征均可包括基础结构特征、语义特征以及边缘特征,等等。
根据一些实施例,所述多个视差图中的具备最大尺寸的视差图可与所述图像对中的各图像的尺寸相一致,当然,所述多个视差图中的各视差图的尺寸也可均小于所述图像对中的各图像的尺寸。另外,所述多个视差图中的任意两个相邻视差图中的后一个视差图的高度和宽度可分别为前一个视差图的高度和宽度的2倍,当然,也可根据实际所需的精度设置为分别为前一个视差图的高度和宽度的3倍、4倍或其它大于1的正整数倍数等。示例的,以所述多个视差图为4个,且所述多个视差图中的最后一个视差图的尺寸为H×W(可与所述图像对中的各图像的尺寸一致)为例,排在其之前的所述多个视差图中的其它各视差图的尺寸可依次为
Figure PCTCN2020121824-appb-000028
(若H×W可被称为全尺寸,则
Figure PCTCN2020121824-appb-000029
可被称为1/2尺寸),
Figure PCTCN2020121824-appb-000030
(可被称为1/4尺寸),以及
Figure PCTCN2020121824-appb-000031
(可被称为1/8尺寸)。
根据一些实施例,所述提取到的图像特征可包括N种尺寸的图像特征,所述N为不小于2的正整数。相应地,如图10所示(图10示出了根据本公开的示例性实施例的视差估计方法中的多级视差处理的流程图),根据提取到的图像特征,进行级联的多级视差处理,得到多个尺寸依次增大的视差图,可包括以下步骤。
步骤S1001:在所述多级视差处理的第一级视差处理中,根据所述N种尺寸的图像特征中的最小尺寸的图像特征的至少一部分,生成具备所述最小尺寸的初始视差图。
示例的,以提取到的所述N种尺寸的图像特征包括1/8尺寸、1/4尺寸、1/2尺寸以及全尺寸四种尺寸的图像特征为例为例,在所述多级视差处理的第一级视差处理中,可根据所述四种尺寸的图像特征中的最小尺寸(即1/8尺寸)的图像特征的至少一部分,生成具备所述最小尺寸(即1/8尺寸)的初始视差图。
另外,如前述表1和表2所示,可将相应的具备对应尺寸的图像特征进行视差平移叠加,使用3DCNN来获取初始视差图,或者,计算平移后的具备对应尺寸的图像特征的差异,使用2DCNN来获取初始视差图。
步骤S1002:在所述多级视差处理的后续每一级视差处理中,根据所述N种尺寸的图像特征中的具备对应尺寸的图像特征的至少一部分,对上一级视差处理所生成的视差图进行视差优化处理,生成具备所述对应尺寸的优化视差图,其中,所述多个视差图至少包括各个优化视差图。
根据一些实施例,所述多级视差处理可包括N+1级视差处理。相应地,所述在所述多级视差处理的后续每一级视差处理中,根据所述N种尺寸的图像特征中的具备对应尺 寸的图像特征的至少一部分,对上一级视差处理所生成的视差图进行视差优化处理,生成具备所述对应尺寸的优化视差图,其中,所述多个视差图至少包括各个优化视差图,可包括:
在除第一级视差处理以外的N级视差处理中,按照尺寸从小到大的顺序,依次基于所述N种尺寸的图像特征中的具备对应尺寸的图像特征的至少一部分,对上一级视差处理所生成的视差图进行视差优化处理,得到N个尺寸依次增大的优化视差图,并将所述N个优化视差图作为所述多个视差图,其中所述N个优化视差图的尺寸分别与所述N种尺寸对应。
示例的,以提取到的所述N种尺寸的图像特征包括1/8尺寸、1/4尺寸、1/2尺寸以及全尺寸4种尺寸的图像特征,且,所述多级视差处理可包括4+1级视差处理为例,在除第一级视差处理以外的4级视差处理中,可按照尺寸从小到大的顺序,依次基于所述4种尺寸的图像特征中的具备对应尺寸的图像特征的至少一部分,对上一级视差处理所生成的视差图进行视差优化处理,得到4个尺寸依次增大的优化视差图(例如,得到具备1/8尺寸的优化视差图、具备1/4尺寸的优化视差图、具备1/2尺寸的优化视差图以及具备全尺寸的优化视差图),并将所述4个优化视差图作为所述多个视差图。
根据另一些实施例,所述多级视差处理可包括N级视差处理。相应地,所述在所述多级视差处理的后续每一级视差处理中,根据所述N种尺寸的图像特征中的具备对应尺寸的图像特征的至少一部分,对上一级视差处理所生成的视差图进行视差优化处理,生成具备所述对应尺寸的优化视差图,其中,所述多个视差图至少包括各个优化视差图,可包括:
在除第一级视差处理以外的N-1级视差处理中,按照尺寸从小到大的顺序,依次基于所述N种尺寸的图像特征中的N-1种非最小尺寸的图像特征中的具备对应尺寸的图像特征的至少一部分,对上一级视差处理所生成的视差图进行视差优化处理,得到N-1个尺寸依次增大的优化视差图,并将所述初始视差图和所述N-1个优化视差图作为所述多个视差图,其中所述初始视差图和所述N-1个优化视差图的尺寸分别与所述N种尺寸对应。
示例的,以提取到的所述N种尺寸的图像特征包括1/8尺寸、1/4尺寸、1/2尺寸以及全尺寸4种尺寸的图像特征,且所述多级视差处理可包括4级视差处理为例,在除第一级视差处理以外的其它3级视差处理中,可按照尺寸从小到大的顺序,依次基于其它3种非最小尺寸的图像特征中的具备对应尺寸的图像特征的至少一部分,对上一级视差处 理所生成的视差图进行视差优化处理,得到3个尺寸依次增大的优化视差图(例如,得到具备1/4尺寸的优化视差图、具备1/2尺寸的优化视差图以及具备全尺寸的优化视差图),并将所述初始视差图以及所述3个优化视差图作为所述多个视差图。
由此,所得到的所述多个视差图可以包含或不包含所述多级视差处理中的第一级视差处理所生成的初始视差图,以提高视差生成的灵活性。
根据一些实施例,在所述多级视差处理的后续每一级视差处理中,根据所述N种尺寸的图像特征中的具备对应尺寸的图像特征的至少一部分,对上一级视差处理所生成的视差图进行视差优化处理,生成具备所述对应尺寸的优化视差图,可包括:
在所述多级视差处理中的除所述第一级视差处理以外的每一级视差处理中,基于具备对应尺寸的图像特征的至少一部分,对上一级视差处理所生成的视差图进行残差计算处理,得到具备所述对应尺寸的残差图,并将具备所述对应尺寸的残差图和所述上一级视差处理所生成的视差图进行组合以得到具备所述对应尺寸的优化视差图。
例如,以提取到的所述N种尺寸的图像特征包括1/8尺寸、1/4尺寸、1/2尺寸以及全尺寸4种尺寸的图像特征,且,所述多级视差处理可包括4+1级视差处理为例,在除第一级视差处理以外的其它4级视差处理的对应1/8尺寸的视差处理中(即对应1/8尺寸的视差优化处理中),可基于提取到的1/8尺寸的图像特征的部分或全部,以及上一级视差处理所生成的初始视差图计算得到1/8尺寸的第一残差图,并基于所述第一残差图以及所述初始视差图计算得到1/8尺寸的第一优化视差图。在下一级对应1/4尺寸的视差优化处理中,可基于提取到的1/4尺寸的图像特征的部分或全部,以及上一级视差处理所生成的第一优化视差图计算得到1/4尺寸的第二残差图,并基于所述第二残差图以及所述第一优化视差图计算得到1/4尺寸的第二优化视差图,等等。
根据一些实施例,在所述多级视差处理中的除所述第一级视差处理以外的每一级视差处理中,在对上一级视差处理所生成的视差图进行视差优化处理之前,所述方法还可包括:响应于所述上一级视差处理所生成的视差图的尺寸小于本级视差处理对应的尺寸,将所述上一级视差处理所生成的视差图上采样为与本级视差处理相对应的尺寸。
例如,仍以提取到的所述N种尺寸的图像特征包括1/8尺寸、1/4尺寸、1/2尺寸以及全尺寸4种尺寸的图像特征,且,所述多级视差处理可包括4+1级视差处理为例,在除第一级视差处理以外的其它4级视差处理的对应1/4尺寸的视差处理中(即对应1/4尺寸的视差优化处理中),可将上一级视差处理所生成的1/8尺寸的第一优化视差图上采样为与本级视差处理相对应的1/4尺寸,之后,可基于提取到的1/4尺寸的图像特征的部 分或全部,对上采样后的1/4尺寸的第一优化视差图进行视差优化处理得到1/4尺寸的第二优化视差图。
根据一些实施例,所述N种尺寸的图像特征中的最小尺寸的图像特征例如可以包括所述图像对中的第一图像的至少一种图像特征以及第二图像的至少一种图像特征。例如,所述N种尺寸的图像特征中的最小尺寸的图像特征可以包括所述图像对中的第一图像(例如左目图像)的基础结构特征、语义特征、边缘特征以及所述图像对中的第二图像(例如右目图像)的基础结构特征。
所述N种尺寸的图像特征中的每一种非最小尺寸的图像特征例如可以包括所述图像对中的所述第一图像的至少一种图像特征和/或所述第二图像的至少一种图像特征。例如,所述N种尺寸的图像特征中的各非最小尺寸的图像特征可以包括所述图像对中的第一图像的边缘特征或基于所述第一图像本身的特征。
另外,参见前述***实施例,生成不同优化视差图所基于的图像特征可以为相同种类的图像特征或不同种类的图像特征;和/或,生成不同优化视差图所基于的图像特征可以为所述图像对中的相同图像或不同图像的图像特征。
生成各优化视差图所基于的图像特征例如可以包括所述图像对中的至少一个图像的边缘特征,和/或,基于所述图像对中的至少一个图像的图像本身的特征。所述基于所述图像对中的至少一个图像的图像本身的特征例如可以包括所述至少一个图像的图像本身,或者根据所需生成的优化视差图的尺寸对所述至少一个图像的图像本身进行降采样所得到的图像。
根据一些实施例,所述视差估计方法还可包括:计算所述多级视差处理中的每一级视差处理的损失函数,所述损失函数可用于表示该级视差处理所生成的视差图中的视差与对应的真实视差之间的误差。这样,通过计算损失函数,可明确所生成的各个视差图的精确度,还可基于损失函数对视差估计方法进行优化。
根据一些实施例,所述视差估计方法还可包括:根据目标设备的性能,从所述多个视差图中选取其尺寸与所述目标设备的性能相匹配的视差图作为提供给所述目标设备的视差图。例如,若所述目标设备的性能较高,和/或其所需的视差图的精度较高,则可从所述多个视差图中选取尺寸较大的视差图提供给所述目标设备。另外,所述目标设备也可根据自身性能,从所述视差估计***所得到的所述多个视差图中主动获取其所需的视差图。
此外,所述视差估计方法还可包括:在对所述图像对中的各图像进行图像特征提取之前,对所述图像对中的图像进行极线校正,使得所述图像对中的图像仅在一个方向(例如水平方向)存在视差。由此,图像的视差搜索范围可仅限于一个方向,从而提高后续特征提取以及视差生成的效率。
本公开的一个方面可包括一种电子设备,该电子设备可包括处理器;以及存储程序的存储器,所述程序包括指令,所述指令在由所述处理器执行时使所述处理器执行前述任何方法。
本公开的一个方面可包括存储程序的计算机可读存储介质,所述程序包括指令,所述指令在由电子设备的处理器执行时,致使所述电子设备执行前述任何方法。
参照图11,现将描述计算设备2000,其是可以应用于本公开的各方面的硬件设备的示例。计算设备2000可以是被配置为执行处理和/或计算的任何机器,可以是但不限于工作站、服务器、台式计算机、膝上型计算机、平板计算机、个人数字助理、智能电话、车载计算机或其任何组合。上述电子设备可以全部或至少部分地由计算设备2000或类似设备或***实现。
计算设备2000可以包括(可能经由一个或多个接口)与总线2002连接或与总线2002通信的元件。例如,计算设备2000可以包括总线2002、一个或多个处理器2004、一个或多个输入设备2006以及一个或多个输出设备2008。一个或多个处理器2004可以是任何类型的处理器,并且可以包括但不限于一个或多个通用处理器和/或一个或多个专用处理器(例如特殊处理芯片)。输入设备2006可以是能向计算设备2000输入信息的任何类型的设备,并且可以包括但不限于鼠标、键盘、触摸屏、麦克风和/或遥控器。输出设备2008可以是能呈现信息的任何类型的设备,并且可以包括但不限于显示器、扬声器、视频/音频输出终端、振动器和/或打印机。计算设备2000还可以包括存储设备2010或者与存储设备2010连接,存储设备可以是非暂时性的并且可以实现数据存储的任何存储设备,并且可以包括但不限于磁盘驱动器、光学存储设备、固态存储器、软盘、柔性盘、硬盘、磁带或任何其他磁介质,光盘或任何其他光学介质、ROM(只读存储器)、RAM(随机存取存储器)、高速缓冲存储器和/或任何其他存储器芯片或盒、和/或计算机可从其读取数据、指令和/或代码的任何其他介质。存储设备2010可以从接口拆卸。存储设备2010可以具有用于实现上述方法和步骤的数据/程序(包括指令)/代码。计算设备2000还可以包括通信设备2012。通信设备2012可以是使得能够与外部设备和/或与网络通信的任何类型的设备或***,并且可以包括但不限于调制解调器、网卡、红外通信设备、 无线通信设备和/或芯片组,例如蓝牙TM设备、1302.11设备、WiFi设备、WiMax设备、蜂窝通信设备和/或类似物。
计算设备2000还可以包括工作存储器2014,其可以是可以存储对处理器2004的工作有用的程序(包括指令)和/或数据的任何类型的工作存储器,并且可以包括但不限于随机存取存储器和/或只读存储器设备。
软件要素(程序)可以位于工作存储器2014中,包括但不限于操作***2016、一个或多个应用(即应用程序)2018、驱动程序和/或其他数据和代码。用于执行上述方法和步骤的指令可以被包括在一个或多个应用2018中,并且上述视差估计***100的特征提取网络200以及视差生成网络300可以通过由处理器2004读取和执行一个或多个应用2018的指令来实现。更具体地,另前述视差估计***100的特征提取网络200可以例如通过处理器2004执行具有执行步骤S901的指令的应用2018而实现。此外,前述视差估计***100的视差生成网络300可以例如通过处理器2004执行具有执行步骤S902的指令的应用2018而实现,等等。软件要素(程序)的指令的可执行代码或源代码可以存储在非暂时性计算机可读存储介质(例如上述存储设备2010)中,并且在执行时可以被存入工作存储器2014中(可能被编译和/或安装)。软件要素(程序)的指令的可执行代码或源代码也可以从远程位置下载。
还应该理解,可以根据具体要求而进行各种变型。例如,也可以使用定制硬件,和/或可以用硬件、软件、固件、中间件、微代码,硬件描述语言或其任何组合来实现特定元件。例如,所公开的方法和设备中的一些或全部可以通过使用根据本公开的逻辑和算法,用汇编语言或硬件编程语言(诸如VERILOG,VHDL,C++)对硬件(例如,包括现场可编程门阵列(FPGA)和/或可编程逻辑阵列(PLA)的可编程逻辑电路)进行编程来实现。
还应该理解,前述方法可以通过服务器-客户端模式来实现。例如,客户端可以接收用户输入的数据并将所述数据发送到服务器。客户端也可以接收用户输入的数据,进行前述方法中的一部分处理,并将处理所得到的数据发送到服务器。服务器可以接收来自客户端的数据,并且执行前述方法或前述方法中的另一部分,并将执行结果返回给客户端。客户端可以从服务器接收到方法的执行结果,并例如可以通过输出设备呈现给用户。
还应该理解,计算设备2000的组件可以分布在网络上。例如,可以使用一个处理器执行一些处理,而同时可以由远离该一个处理器的另一个处理器执行其他处理。计算 设备2000的其他组件也可以类似地分布。这样,计算设备2000可以被解释为在多个位置执行处理的分布式计算***。
虽然已经参照附图描述了本公开的实施例或示例,但应理解,上述的方法、***和设备仅仅是示例性的实施例或示例,本发明的范围并不由这些实施例或示例限制,而是仅由授权后的权利要求书及其等同范围来限定。实施例或示例中的各种要素可以被省略或者可由其等同要素替代。此外,可以通过不同于本公开中描述的次序来执行各步骤。进一步地,可以以各种方式组合实施例或示例中的各种要素。重要的是随着技术的演进,在此描述的很多要素可以由本公开之后出现的等同要素进行替换。

Claims (36)

  1. 一种视差估计***,包括:
    特征提取网络,被配置为对图像对中的各图像进行特征提取,并将提取到的图像特征输出给视差生成网络;以及
    所述视差生成网络,被配置为根据所述提取到的图像特征,进行级联的多级视差处理,得到多个尺寸依次增大的视差图,
    其中,所述多级视差处理中的第一级视差处理的输入包括多个具有与该级视差处理相对应的尺寸的图像特征;所述多级视差处理中的除所述第一级视差处理以外的每一级视差处理的输入包括:一个或多个具有与该级视差处理相对应的尺寸的图像特征,以及上一级视差处理所生成的视差图。
  2. 如权利要求1所述的视差估计***,其中,所提取到的所述图像对中的各图像的图像特征至少包括以下特征中的一种或多种:
    基础结构特征、语义特征、边缘特征、纹理特征、颜色特征、物体形状特征、或基于图像本身的特征。
  3. 如权利要求2所述的视差估计***,其中,所述特征提取网络包括多个分别用于提取图像的不同特征的特征提取子网络;所述多个特征提取子网络至少包括基础结构特征子网络、语义特征子网络以及边缘特征子网络。
  4. 如权利要求1所述的视差估计***,其中,所述多个视差图中的具备最大尺寸的视差图与所述图像对中的各图像的尺寸相一致。
  5. 如权利要求1所述的视差估计***,其中,
    所述提取到的图像特征包括N种尺寸的图像特征,所述N为不小于2的正整数;
    所述视差生成网络被配置为,在所述多级视差处理的所述第一级视差处理中,根据所述N种尺寸的图像特征中的最小尺寸的图像特征的至少一部分,生成具备所述最小尺寸的初始视差图;以及
    在所述多级视差处理的后续每一级视差处理中,根据所述N种尺寸的图像特征中的具备对应尺寸的图像特征的至少一部分,对上一级视差处理所生成的视差图进行视差优化处理,生成具备所述对应尺寸的优化视差图;
    其中,所述多个视差图至少包括各个优化视差图。
  6. 如权利要求5所述的视差估计***,其中,
    所述多级视差处理包括N+1级视差处理;
    所述视差生成网络被配置为,在除第一级视差处理以外的N级视差处理中,按照尺寸从小到大的顺序,依次基于所述N种尺寸的图像特征中的具备对应尺寸的图像特征的至少一部分,对上一级视差处理所生成的视差图进行视差优化处理,得到N个尺寸依次增大的优化视差图,并将所述N个优化视差图作为所述多个视差图,其中所述N个优化视差图的尺寸分别与所述N种尺寸对应。
  7. 如权利要求5所述的视差估计***,其中,
    所述多级视差处理包括N级视差处理;
    所述视差生成网络被配置为,在除第一级视差处理以外的N-1级视差处理中,按照尺寸从小到大的顺序,依次基于所述N种尺寸的图像特征中的N-1种非最小尺寸的图像特征中的具备对应尺寸的图像特征的至少一部分,对上一级视差处理所生成的视差图进行视差优化处理,得到N-1个尺寸依次增大的优化视差图,并将所述初始视差图和所述N-1个优化视差图作为所述多个视差图,其中所述初始视差图和所述N-1个优化视差图的尺寸分别与所述N种尺寸对应。
  8. 如权利要求5-7中任一项所述的视差估计***,其中,
    所述视差生成网络被配置为,在所述多级视差处理中的除所述第一级视差处理以外的每一级视差处理中,基于具备对应尺寸的图像特征的至少一部分,对上一级视差处理所生成的视差图进行残差计算处理,得到具备所述对应尺寸的残差图,并将具备所述对应尺寸的残差图和所述上一级视差处理所生成的视差图进行组合以得到具备所述对应尺寸的优化视差图。
  9. 如权利要求5-7中任一项所述的视差估计***,其中,
    所述视差生成网络被配置为,在所述多级视差处理中的除所述第一级视差处理以外的每一级视差处理中,在对上一级视差处理所生成的视差图进行视差优化处理之前,响应于所述上一级视差处理所生成的视差图的尺寸小于本级视差处理对应的尺寸,将所述上一级视差处理所生成的视差图上采样为与本级视差处理相对应的尺寸。
  10. 如权利要求5-7中任一项所述的视差估计***,其中,所述N种尺寸的图像特征中的最小尺寸的图像特征包括所述图像对中的第一图像的至少一种图像特征以及第二图像的至少一种图像 特征,所述N种尺寸的图像特征中的每一种非最小尺寸的图像特征包括所述图像对中的所述第一图像的至少一种图像特征和/或所述第二图像的至少一种图像特征。
  11. 如权利要求5-7中任一项所述的视差估计***,其中,生成不同优化视差图所基于的图像特征为相同种类的图像特征或不同种类的图像特征;和/或,
    生成不同优化视差图所基于的图像特征为所述图像对中的相同图像或不同图像的图像特征。
  12. 如权利要求5-7中任一项所述的视差估计***,其中,生成各优化视差图所基于的图像特征包括所述图像对中的至少一个图像的边缘特征,和/或,基于所述图像对中的至少一个图像的图像本身的特征。
  13. 如权利要求12所述的视差估计***,其中,基于所述图像对中的至少一个图像的图像本身的特征包括所述至少一个图像的图像本身,或者根据所需生成的优化视差图的尺寸对所述至少一个图像的图像本身进行降采样所得到的图像。
  14. 如权利要求1所述的视差估计***,其中,所述视差生成网络包括初始视差生成子网络以及至少一个视差优化子网络,所述初始视差生成子网络以及所述至少一个视差优化子网络中的各视差优化子网络依次级联,所述初始视差生成子网络被配置为进行所述第一级视差处理,所述至少一个视差优化子网络被配置为进行除所述第一级视差处理以外的各级视差处理。
  15. 如权利要求14所述的视差估计***,其中,
    所述初始视差生成子网络以及所述至少一个视差优化子网络中的每一视差优化子网络为二维深度卷积神经网络2DCNN或三维深度卷积神经网络3DCNN。
  16. 如权利要求15所述的视差估计***,其中,所述至少一个视差优化子网络中的每一视差优化子网络所包括的卷积层的数量小于所述初始视差生成子网络所包括的卷积层的数量。
  17. 如权利要求14所述的视差估计***,其中,所述初始视差生成子网络以及所述至少一个视差优化子网络中的每一子网络被配置为计算损失函数,所述损失函数表示所述子网络生成的视差图中的视差与对应的真实视差之间的误差。
  18. 如权利要求1所述的视差估计***,其中,
    所述视差生成网络被配置为根据目标设备的性能,从所述多个视差图中选取其尺寸与所述目标设备的性能相匹配的视差图作为提供给所述目标设备的视差图。
  19. 如权利要求1所述的视差估计***,其中,所述图像对为通过多目摄像机采集到的针对同一场景的图像对。
  20. 一种视差估计方法,包括:
    对图像对中的各图像进行特征提取;以及
    根据提取到的图像特征,进行级联的多级视差处理,得到多个尺寸依次增大的视差图,
    其中,所述多级视差处理中的第一级视差处理的输入包括多个具有与该级视差处理相对应的尺寸的图像特征;所述多级视差处理中的除所述第一级视差处理以外的每一级视差处理的输入包括:一个或多个具有与该级视差处理相对应的尺寸的图像特征,以及上一级视差处理所生成的视差图。
  21. 如权利要求20所述的视差估计方法,其中,所提取到的所述图像对中的各图像的图像特征至少包括以下特征中的一种或多种:
    基础结构特征、语义特征、边缘特征、纹理特征、颜色特征、物体形状特征、或基于图像本身的特征。
  22. 如权利要求20所述的视差估计方法,其中,所述多个视差图中的具备最大尺寸的视差图与所述图像对中的各图像的尺寸相一致。
  23. 如权利要求20所述的视差估计方法,其中,
    所述提取到的图像特征包括N种尺寸的图像特征,所述N为不小于2的正整数;
    所述根据提取到的图像特征,进行级联的多级视差处理,得到多个尺寸依次增大的视差图,包括:
    在所述多级视差处理的所述第一级视差处理中,根据所述N种尺寸的图像特征中的最小尺寸的图像特征的至少一部分,生成具备所述最小尺寸的初始视差图;以及
    在所述多级视差处理的后续每一级视差处理中,根据所述N种尺寸的图像特征中的具备对应尺寸的图像特征的至少一部分,对上一级视差处理所生成的视差图进行视差优化处理,生成具备所述对应尺寸的优化视差图;
    其中,所述多个视差图至少包括各个优化视差图。
  24. 如权利要求23所述的视差估计方法,其中,
    所述多级视差处理包括N+1级视差处理;
    所述在所述多级视差处理的后续每一级视差处理中,根据所述N种尺寸的图像特征中的具备对应尺寸的图像特征的至少一部分,对上一级视差处理所生成的视差图进行视差优化处理,生成具备所述对应尺寸的优化视差图,其中,所述多个视差图至少包括各个优化视差图,包括:
    在除第一级视差处理以外的N级视差处理中,按照尺寸从小到大的顺序,依次基于所述N种尺寸的图像特征中的具备对应尺寸的图像特征的至少一部分,对上一级视差处理所生成的视差图进行视差优化处理,得到N个尺寸依次增大的优化视差图,并将所述N个优化视差图作为所述多个视差图,其中所述N个优化视差图的尺寸分别与所述N种尺寸对应。
  25. 如权利要求23所述的视差估计方法,其中,
    所述多级视差处理包括N级视差处理;
    所述在所述多级视差处理的后续每一级视差处理中,根据所述N种尺寸的图像特征中的具备对应尺寸的图像特征的至少一部分,对上一级视差处理所生成的视差图进行视差优化处理,生成具备所述对应尺寸的优化视差图,其中,所述多个视差图至少包括各个优化视差图,包括:
    在除第一级视差处理以外的N-1级视差处理中,按照尺寸从小到大的顺序,依次基于所述N种尺寸的图像特征中的N-1种非最小尺寸的图像特征中的具备对应尺寸的图像特征的至少一部分,对上一级视差处理所生成的视差图进行视差优化处理,得到N-1个尺寸依次增大的优化视差图,并将所述初始视差图和所述N-1个优化视差图作为所述多个视差图,其中所述初始视差图和所述N-1个优化视差图的尺寸分别与所述N种尺寸对应。
  26. 如权利要求23-25中任一项所述的视差估计方法,其中,
    在所述多级视差处理的后续每一级视差处理中,根据所述N种尺寸的图像特征中的具备对应尺寸的图像特征的至少一部分,对上一级视差处理所生成的视差图进行视差优化处理,生成具备所述对应尺寸的优化视差图,包括:
    在所述多级视差处理中的除所述第一级视差处理以外的每一级视差处理中,基于具备对应尺寸的图像特征的至少一部分,对上一级视差处理所生成的视差图进行残差计算处理,得到具备所述对应尺寸的残差图,并将具备所述对应尺寸的残差图和所述上一级视差处理所生成的视差图进行组合以得到具备所述对应尺寸的优化视差图。
  27. 如权利要求23-25中任一项所述的视差估计方法,其中,在所述多级视差处理中的除所述第一级视差处理以外的每一级视差处理中,在对上一级视差处理所生成的视差图进行视差优化处理之前,所述方法还包括:
    响应于所述上一级视差处理所生成的视差图的尺寸小于本级视差处理对应的尺寸,将所述上一级视差处理所生成的视差图上采样为与本级视差处理相对应的尺寸。
  28. 如权利要求23-25中任一项所述的视差估计方法,其中,所述N种尺寸的图像特征中的最小尺寸的图像特征包括所述图像对中的第一图像的至少一种图像特征以及第二图像的至少一种图像特征,所述N种尺寸的图像特征中的每一种非最小尺寸的图像特征包括所述图像对中的所述第一图像的至少一种图像特征和/或所述第二图像的至少一种图像特征。
  29. 如权利要求23-25中任一项所述的视差估计方法,其中,生成不同优化视差图所基于的图像特征为相同种类的图像特征或不同种类的图像特征;和/或,
    生成不同优化视差图所基于的图像特征为所述图像对中的相同图像或不同图像的图像特征。
  30. 如权利要求23-25中任一项所述的视差估计方法,其中,生成各优化视差图所基于的图像特征包括所述图像对中的至少一个图像的边缘特征,和/或,基于所述图像对中的至少一个图像的图像本身的特征。
  31. 如权利要求30所述的视差估计方法,其中,基于所述图像对中的至少一个图像的图像本身的特征包括所述至少一个图像的图像本身,或者根据所需生成的优化视差图的尺寸对所述至少一个图像的图像本身进行降采样所得到的图像。
  32. 如权利要求20所述的视差估计方法,还包括:
    计算所述多级视差处理中的每一级视差处理的损失函数,所述损失函数表示该级视差处理所生成的视差图中的视差与对应的真实视差之间的误差。
  33. 如权利要求20所述的视差估计方法,还包括:
    根据目标设备的性能,从所述多个视差图中选取其尺寸与所述目标设备的性能相匹配的视差图作为提供给所述目标设备的视差图。
  34. 如权利要求20所述的视差估计方法,其中,所述图像对为通过多目摄像机采集到的针对同一场景的图像对。
  35. 一种电子设备,包括:
    处理器;以及
    存储程序的存储器,所述程序包括指令,所述指令在由所述处理器执行时使所述处理器执行根据权利要求20-34中任一项所述的方法。
  36. 一种存储程序的计算机可读存储介质,所述程序包括指令,所述指令在由电子设备的处理器执行时,致使所述电子设备执行根据权利要求20-34中任一项所述的方法。
PCT/CN2020/121824 2019-12-13 2020-10-19 视差估计***、方法、电子设备及计算机可读存储介质 WO2021114870A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US17/127,540 US11158077B2 (en) 2019-12-13 2020-12-18 Disparity estimation

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201911281475.1A CN112991254A (zh) 2019-12-13 2019-12-13 视差估计***、方法、电子设备及计算机可读存储介质
CN201911281475.1 2019-12-13

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/127,540 Continuation US11158077B2 (en) 2019-12-13 2020-12-18 Disparity estimation

Publications (1)

Publication Number Publication Date
WO2021114870A1 true WO2021114870A1 (zh) 2021-06-17

Family

ID=73789924

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/121824 WO2021114870A1 (zh) 2019-12-13 2020-10-19 视差估计***、方法、电子设备及计算机可读存储介质

Country Status (6)

Country Link
US (1) US11158077B2 (zh)
EP (1) EP3836083B1 (zh)
JP (1) JP6902811B2 (zh)
KR (1) KR102289239B1 (zh)
CN (1) CN112991254A (zh)
WO (1) WO2021114870A1 (zh)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11645505B2 (en) * 2020-01-17 2023-05-09 Servicenow Canada Inc. Method and system for generating a vector representation of an image
CN114494087A (zh) * 2020-11-12 2022-05-13 安霸国际有限合伙企业 无监督的多尺度视差/光流融合
CN113808187A (zh) * 2021-09-18 2021-12-17 京东鲲鹏(江苏)科技有限公司 视差图生成方法、装置、电子设备和计算机可读介质
WO2024025851A1 (en) * 2022-07-26 2024-02-01 Becton, Dickinson And Company System and method for estimating object distance and/or angle from an image capture device

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109472819A (zh) * 2018-09-06 2019-03-15 杭州电子科技大学 一种基于级联几何上下文神经网络的双目视差估计方法
US20190295282A1 (en) * 2018-03-21 2019-09-26 Nvidia Corporation Stereo depth estimation using deep neural networks
CN110427968A (zh) * 2019-06-28 2019-11-08 武汉大学 一种基于细节增强的双目立体匹配方法

Family Cites Families (24)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
FR2700654B1 (fr) * 1993-01-19 1995-02-17 Thomson Csf Procédé d'estimation de disparité entre les images monoscopiques constituant une image stéréoscopiques.
US8384763B2 (en) * 2005-07-26 2013-02-26 Her Majesty the Queen in right of Canada as represented by the Minster of Industry, Through the Communications Research Centre Canada Generating a depth map from a two-dimensional source image for stereoscopic and multiview imaging
US7587081B2 (en) * 2005-09-28 2009-09-08 Deere & Company Method for processing stereo vision data using image density
KR100762670B1 (ko) * 2006-06-07 2007-10-01 삼성전자주식회사 스테레오 이미지로부터 디스패리티 맵을 생성하는 방법 및장치와 그를 위한 스테레오 매칭 방법 및 장치
US8385630B2 (en) * 2010-01-05 2013-02-26 Sri International System and method of processing stereo images
US9111342B2 (en) * 2010-07-07 2015-08-18 Electronics And Telecommunications Research Institute Method of time-efficient stereo matching
US8670630B1 (en) * 2010-12-09 2014-03-11 Google Inc. Fast randomized multi-scale energy minimization for image processing
KR101178015B1 (ko) * 2011-08-31 2012-08-30 성균관대학교산학협력단 시차 맵 생성 방법
US8867826B2 (en) * 2012-11-26 2014-10-21 Mitusbishi Electric Research Laboratories, Inc. Disparity estimation for misaligned stereo image pairs
JP6192507B2 (ja) 2013-11-20 2017-09-06 キヤノン株式会社 画像処理装置、その制御方法、および制御プログラム、並びに撮像装置
EP2887311B1 (en) * 2013-12-20 2016-09-14 Thomson Licensing Method and apparatus for performing depth estimation
EP2916290A1 (en) * 2014-03-07 2015-09-09 Thomson Licensing Method and apparatus for disparity estimation
US10074158B2 (en) * 2014-07-08 2018-09-11 Qualcomm Incorporated Systems and methods for stereo depth estimation using global minimization and depth interpolation
PL411631A1 (pl) * 2015-03-18 2016-09-26 Politechnika Poznańska System do generowania mapy głębi i sposób generowania mapy głębi
CN108604371A (zh) * 2016-02-25 2018-09-28 深圳市大疆创新科技有限公司 成像***和方法
CN106600583B (zh) 2016-12-07 2019-11-01 西安电子科技大学 基于端到端神经网络的视差图获取方法
KR102459853B1 (ko) * 2017-11-23 2022-10-27 삼성전자주식회사 디스패리티 추정 장치 및 방법
TWI637350B (zh) * 2018-01-09 2018-10-01 緯創資通股份有限公司 產生視差圖的方法及其影像處理裝置與系統
CN108335322B (zh) * 2018-02-01 2021-02-12 深圳市商汤科技有限公司 深度估计方法和装置、电子设备、程序和介质
JP7344660B2 (ja) * 2018-03-30 2023-09-14 キヤノン株式会社 視差算出装置、視差算出方法及び視差算出装置の制御プログラム
US10380753B1 (en) * 2018-05-30 2019-08-13 Aimotive Kft. Method and apparatus for generating a displacement map of an input dataset pair
CN109389078B (zh) * 2018-09-30 2022-06-21 京东方科技集团股份有限公司 图像分割方法、相应的装置及电子设备
US10839543B2 (en) * 2019-02-26 2020-11-17 Baidu Usa Llc Systems and methods for depth estimation using convolutional spatial propagation networks
CN110148179A (zh) 2019-04-19 2019-08-20 北京地平线机器人技术研发有限公司 一种训练用于估计图像视差图的神经网络模型方法、装置及介质

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190295282A1 (en) * 2018-03-21 2019-09-26 Nvidia Corporation Stereo depth estimation using deep neural networks
CN109472819A (zh) * 2018-09-06 2019-03-15 杭州电子科技大学 一种基于级联几何上下文神经网络的双目视差估计方法
CN110427968A (zh) * 2019-06-28 2019-11-08 武汉大学 一种基于细节增强的双目立体匹配方法

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
CHEN CHUNHONG: "Stereo matching algorithm based on multi-scale information and attention", THESIS, 27 August 2019 (2019-08-27), CN, pages 1 - 43, XP009528773 *

Also Published As

Publication number Publication date
EP3836083A1 (en) 2021-06-16
US20210209782A1 (en) 2021-07-08
KR20210076853A (ko) 2021-06-24
JP6902811B2 (ja) 2021-07-14
KR102289239B1 (ko) 2021-08-12
US11158077B2 (en) 2021-10-26
EP3836083B1 (en) 2023-08-09
JP2021096850A (ja) 2021-06-24
CN112991254A (zh) 2021-06-18

Similar Documents

Publication Publication Date Title
US10733431B2 (en) Systems and methods for optimizing pose estimation
US10796452B2 (en) Optimizations for structure mapping and up-sampling
WO2021114870A1 (zh) 视差估计***、方法、电子设备及计算机可读存储介质
US10586350B2 (en) Optimizations for dynamic object instance detection, segmentation, and structure mapping
AU2017324923B2 (en) Predicting depth from image data using a statistical model
Wang et al. Multi-view stereo in the deep learning era: A comprehensive review
US10353271B2 (en) Depth estimation method for monocular image based on multi-scale CNN and continuous CRF
US11640690B2 (en) High resolution neural rendering
EP3493105A1 (en) Optimizations for dynamic object instance detection, segmentation, and structure mapping
EP3493106B1 (en) Optimizations for dynamic object instance detection, segmentation, and structure mapping
US11636570B2 (en) Generating digital images utilizing high-resolution sparse attention and semantic layout manipulation neural networks
US20230154170A1 (en) Method and apparatus with multi-modal feature fusion
US20240119697A1 (en) Neural Semantic Fields for Generalizable Semantic Segmentation of 3D Scenes
US11127115B2 (en) Determination of disparity
WO2019108250A1 (en) Optimizations for dynamic object instance detection, segmentation, and structure mapping
WO2023036157A1 (en) Self-supervised spatiotemporal representation learning by exploring video continuity
WO2022197439A1 (en) High resolution neural rendering
Huang Moving object detection in low-luminance images
WO2021114871A1 (zh) 视差确定方法、电子设备及计算机可读存储介质
Liu et al. Enhancing Point Features with Spatial Information for Point‐Based 3D Object Detection
JP2023082681A (ja) オブジェクト姿勢推定装置及び方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20898340

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20898340

Country of ref document: EP

Kind code of ref document: A1