CN112949504B

CN112949504B - Stereo matching method, device, equipment and storage medium

Info

Publication number: CN112949504B
Application number: CN202110244418.7A
Authority: CN
Inventors: 俞正中; 戴齐飞; 艾新东; 赵勇; 李福池
Original assignee: Shenzhen Apical Technology Co ltd
Current assignee: Shenzhen Apical Technology Co ltd
Priority date: 2021-03-05
Filing date: 2021-03-05
Publication date: 2024-03-19
Anticipated expiration: 2041-03-05
Also published as: CN112949504A

Abstract

The invention discloses a stereo matching method, a device, equipment and a storage medium, wherein the method comprises the following steps: acquiring an original image pair obtained by a binocular camera, wherein the image pair comprises a left image and a right image; respectively extracting a first left characteristic image and a first right characteristic image which correspond to the left image and the right image respectively; respectively inputting the first left characteristic diagram and the first right characteristic diagram to a convolutional neural network module to obtain a target left characteristic diagram corresponding to the first left characteristic diagram and a target right characteristic diagram corresponding to the first right characteristic diagram; and obtaining a perspective view according to the target left feature image and the target right feature image, so that more accurate matching is achieved, and meanwhile, the matching efficiency is improved.

Description

Stereo matching method, device, equipment and storage medium

Technical Field

The invention relates to the technical field of computer stereoscopic vision, in particular to a stereoscopic matching method, a stereoscopic matching device, stereoscopic matching equipment and a storage medium.

Background

With the rapid development of artificial intelligence and computer technology, the technology of using machine vision to replace human eyes to make measurement and judgment is becoming an important research point. The automatic production line can improve the flexibility and the automation degree of production, and is particularly suitable for dangerous working environments unsuitable for manual operation or occasions where manual vision is difficult to meet the requirements. As an important branch of machine vision, the binocular stereo vision (Binocular Stereo Vision) has the advantages of high efficiency, proper precision, simple system structure, low cost and the like, and has great application value in many directions such as virtual reality, robot navigation, non-contact measurement and the like.

Binocular stereoscopic vision processes the real world by simulating the human visual system, the steps of which consist mainly of 4 phases, which are respectively: calibrating an offline camera to obtain internal and external parameters, distortion coefficients and the like of the camera; correcting, namely removing the influence caused by optical distortion and changing the binocular camera into a standard mode; stereo matching, namely obtaining a parallax image; and 3D distance calculation, namely calculating actual depth information of the object according to the parallax map.

Stereo matching is an important point and a difficult point of binocular stereo vision, research on the stereo matching is very active at home and abroad at present, input of the stereo matching is two standardized left and right images which only have differences in the horizontal direction, more specifically, a certain point on an actual object is assumed to be (x, y) at a left image imaging position, x is not more than a, y=b at a right image imaging position. The existing convolutional neural network algorithm for stereo matching usually has poor performance at the edge of an object, and a mismatching phenomenon occurs at the edge of the object, so that the progress and the efficiency of stereo matching are affected.

Accordingly, there is a need for improvement and development in the art.

Disclosure of Invention

The invention aims to solve the technical problems that aiming at the defects of the prior art, a three-dimensional matching method, a device, equipment and a storage medium are provided, so that the technical problems that a convolution neural network algorithm for three-dimensional matching in the prior art is usually poor in performance at the edge of an object and a mismatching phenomenon occurs at the edge of the object, and the progress and efficiency of three-dimensional matching are affected are solved.

In order to achieve the above purpose, the invention adopts the following technical scheme:

in a first aspect, an embodiment of the present invention provides a stereo matching method, where the method includes:

acquiring an original image pair obtained by a binocular camera, wherein the image pair comprises a left image and a right image;

respectively extracting a first left characteristic image and a first right characteristic image which correspond to the left image and the right image respectively;

respectively inputting the first left characteristic diagram and the first right characteristic diagram to a convolutional neural network module to obtain a target left characteristic diagram corresponding to the first left characteristic diagram and a target right characteristic diagram corresponding to the first right characteristic diagram;

and obtaining a perspective view according to the target left characteristic diagram and the target right characteristic diagram.

As a further improved technical scheme, the convolutional neural network module comprises a first module and a second module, wherein the first module and the second module comprise a first unit, a second unit, a weight calculation unit, a multiplication unit and an addition unit, and the first unit, the second unit, the weight calculation unit, the multiplication unit and the addition unit are sequentially connected.

As a further improvement technical scheme, the inputting the first left feature map and the first right feature map to the convolutional neural network module respectively to obtain a target left feature map corresponding to the first left feature map and a target right feature map corresponding to the first right feature map specifically includes:

respectively inputting the first left characteristic diagram and the second characteristic diagram to the first unit to obtain a second left characteristic diagram corresponding to the first left characteristic diagram and a second right characteristic diagram corresponding to the first right characteristic diagram;

respectively inputting the second left characteristic diagram and the second right characteristic diagram to the second unit to obtain a third left characteristic diagram corresponding to the second left characteristic diagram and a third right characteristic diagram corresponding to the second right characteristic diagram;

inputting the third left feature map and the third right feature map to the weight calculation unit to obtain a fourth left feature map and a fourth right feature map respectively;

inputting the fourth left feature map and the second left feature map to the multiplication unit to obtain a fifth left feature map;

inputting the fourth right feature map and the second right feature map to the multiplication unit to obtain a fifth right feature map;

inputting the fifth left feature map and the first left feature map to the adding unit to obtain a target left feature map;

the fifth right feature map and the first right feature map are input to the adding unit to obtain a target right feature map.

As a further improvement, the first unit comprises two third units, the third units comprising one 2D convolution layer and 1 activation function.

As a further improvement technical scheme, the convolution kernel size of the 2D convolution layer is 3×3; the activation function is ReLu.

As a further improvement technical solution, the inputting the third left feature map and the third right feature map to the weight calculating unit, to obtain a fourth left feature map and a fourth right feature map respectively specifically includes:

for each pixel point on the third left feature map, calculating the pixel point which is in the same row as the pixel point on the third right feature map and meets the specified range to obtain a first pixel point with the minimum difference value with the pixel point, calculating the weight of the pixel point according to the first pixel point, and obtaining a fourth left feature map;

and calculating the pixel points which are in the same row with the pixel points on the third right feature map and meet the specified range aiming at each pixel point on the third right feature map to obtain a second pixel point with the smallest difference value with the pixel points, calculating the weight of the pixel points according to the second pixel point, and obtaining a fourth right feature map.

As a further improvement technical scheme, the calculation formula of the weight is as follows:

1-sigmoid (M), where M is the pixel point of the smallest difference.

In a second aspect, an embodiment of the present invention provides a stereo matching apparatus, the apparatus including:

the acquisition module is used for acquiring an original image pair obtained by the binocular camera, wherein the image pair comprises a left image and a right image;

the extraction module is used for respectively extracting a first left characteristic image and a first right characteristic image which are respectively corresponding to the left image and the right image;

the data module is used for respectively inputting the first left characteristic diagram and the first right characteristic diagram to the convolutional neural network module so as to obtain a target left characteristic diagram corresponding to the first left characteristic diagram and a target right characteristic diagram corresponding to the first right characteristic diagram;

and the matching module is used for obtaining a perspective view according to the target left feature map and the target right feature map.

In a third aspect, an embodiment of the present invention provides a stereo matching apparatus, the apparatus including: a processor and a memory and a communication bus; the memory has stored thereon a computer readable program executable by the processor;

the communication bus enables connection communication between the processor and the memory;

the steps in the stereo matching method as described in any one of the above are implemented when the processor executes the computer readable program.

In a fourth aspect, embodiments of the present invention provide a computer-readable storage medium storing one or more programs executable by one or more processors to implement steps in a stereo matching method as described in any one of the above.

The beneficial effects are that: compared with the prior art, the invention provides a stereo matching method, a device, equipment and a storage medium, wherein the method comprises the following steps: acquiring an original image pair obtained by a binocular camera, wherein the image pair comprises a left image and a right image; respectively extracting a first left characteristic image and a first right characteristic image which correspond to the left image and the right image respectively; respectively inputting the first left characteristic diagram and the first right characteristic diagram to a convolutional neural network module to obtain a target left characteristic diagram corresponding to the first left characteristic diagram and a target right characteristic diagram corresponding to the first right characteristic diagram; and obtaining a perspective view according to the target left feature image and the target right feature image, so that more accurate matching is achieved, and meanwhile, the matching efficiency is improved.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a preferred embodiment of a stereo matching method according to the present invention;

FIG. 2 is a flow chart illustrating the whole implementation process of the stereo matching method according to the present invention;

fig. 3 is a schematic diagram of a convolutional neural network module in the stereo matching method provided by the invention;

FIG. 4 is a flowchart of a preferred embodiment of step S200 in the stereo matching method according to the present invention;

FIG. 5 is a schematic structural diagram of a preferred embodiment of the stereo matching device according to the present invention;

fig. 6 is a schematic structural diagram of a preferred embodiment of the stereo matching apparatus provided by the present invention.

Detailed Description

The invention provides a stereo matching method, a device, equipment and a storage medium, which are used for making the purposes, the technical scheme and the effects of the invention clearer and more definite, and the invention is further described in detail below by referring to the accompanying drawings and the embodiments. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.

As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless expressly stated otherwise, as understood by those skilled in the art. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. The term "and/or" as used herein includes all or any element and all combination of one or more of the associated listed items.

It will be understood by those skilled in the art that all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs unless defined otherwise. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

The invention will be further described by the description of embodiments with reference to the accompanying drawings.

The embodiment provides a stereo matching method, as shown in fig. 1 and fig. 2, including:

s100, acquiring an original image pair obtained by a binocular camera, wherein the image pair comprises a left image and a right image.

In this embodiment, an original image pair is obtained by a binocular camera, where the image pair includes a left image and a right image, and the present invention performs corresponding computation and processing on the left image and the right image through the following steps, so that a perspective view can be obtained.

S200, respectively extracting a first left characteristic image and a first right characteristic image which are respectively corresponding to the left image and the right image.

In the embodiment of the invention, two residual convolution neural networks are used, the image characteristics of the left image and the right image are respectively extracted, the structures of the two residual convolution networks are the same, and the two residual convolution networks share network parameters; the residual convolution neural network comprises a plurality of convolution layers, a processing normalization layer and a nonlinear activation function layer are connected behind each convolution layer, after the left image and the right image are respectively input into the residual convolution neural network, preprocessing is carried out through a plurality of convolution layers at the front end, and the height and the width of the image are respectively reduced to one half of the original height and the width; the number of convolutional layers at the end of the residual convolutional neural network uses a hole convolutional network.

S300, the first left characteristic diagram and the first right characteristic diagram are respectively input into a convolutional neural network module to obtain a target left characteristic diagram corresponding to the first left characteristic diagram and a target right characteristic diagram corresponding to the first right characteristic diagram.

In this embodiment, as shown in fig. 3, the convolutional neural network module includes a first module and a second module, where the first module and the second module each include a first unit, a second unit, a weight calculation unit, a multiplication unit, and an addition unit, and the first unit, the second unit, the weight calculation unit, the multiplication unit, and the addition unit are sequentially connected.

Further, the inputting the first left feature map and the first right feature map to the convolutional neural network module respectively to obtain a target left feature map corresponding to the first left feature map and a target right feature map corresponding to the first right feature map specifically includes:

Specifically, the first unit includes two third units, the third units include one 2D convolution layer and 1 activation function, the convolution kernel size of the 2D convolution layer is 3×3, and the number of output channels is 16; the activation function is ReLu. In this embodiment, the first left feature map and the first right feature map are aggregated by using two 2D convolution layers with a convolution kernel size of 3×3 and an output channel number of 16 and an activation function, so as to obtain a second left feature map and the second right feature map. In practical application, left and right feature images (a first left feature image and a first right feature image) of any middle layer in the convolutional neural network feature extraction stage are input, and the sizes of the left and right feature images are C multiplied by H multiplied by W, and C, H, W respectively represent the channel number, the height and the width of the feature images. The obtained second left characteristic diagram and second right characteristic diagram are aggregated by 1 2D convolution layers, then the number of channels is reduced to 1 by a special convolution layer with the number of output channels being 1, the number of channels is reduced to 1 by the special convolution layer with the convolution kernel size being 1, so as to obtain two corresponding middle characteristic diagrams (a third left characteristic diagram and a third right characteristic diagram), the output of the latter convolution layer (the output is two characteristic diagrams with the same size as the original characteristic diagram, and the values of points on the two output characteristic diagrams are all more than or equal to 0 due to the fact that the activation function ReLu is passed), the number of channels is reduced to 1 by the special convolution layer with the convolution kernel size being 1, so that the middle characteristic diagram is obtained. The two middle feature graphs are used for calculating the weight matrix, and the channel number is 1, so that compared with the left and right feature graphs with the channel number not reduced to 1, the calculation speed of the weight matrix is greatly increased.

In practical application, the convolutional neural network module replaces the first 3 layers of the feature extraction network of the PSMAT, and the output channel number can be 32; if the convolutional neural network module of the present application is replaced by the last 3 layers of the feature extraction network of PSMNet, the number of output channels may be 128; in a preferred embodiment, the convolutional neural network module of the present application replaces the last 6 layers of the feature extraction network of PSMNet, and the number of output channels is 128. It should be noted that the number of output channels may be changed according to actual requirements, and the present invention is not limited thereto.

Further, referring to fig. 4, a flowchart of step S300 in the stereo matching method provided by the present invention is shown.

As shown in fig. 4, the inputting the third left feature map and the third right feature map to the weight calculating unit, to obtain a fourth left feature map and a fourth right feature map respectively specifically includes:

s301, calculating the pixel points which are in the same row with the pixel points on the third right feature map and meet a specified range aiming at each pixel point on the third left feature map to obtain a first pixel point with the smallest difference value with the pixel points, calculating the weight of the pixel points according to the first pixel point, and obtaining a fourth left feature map;

s302, calculating the pixel points which are in the same row as the pixel points on the third left feature map and meet the specified range for each pixel point on the third right feature map to obtain a second pixel point with the smallest difference value with the pixel points, calculating the weight of the pixel point according to the second pixel point, and obtaining a fourth right feature map.

In this embodiment, the calculation formula of the weight is: 1-sigmoid (M), where M is the pixel point of the smallest difference.

Specifically, the invention calculates the left and right weight matrixes according to the two middle feature maps (the third left feature map and the third right feature map) by searching the point with the smallest absolute value subtracted from the point in the same row of the other middle feature map in the same artificial designated range for all points on each middle feature map, so that the point with the smallest absolute value subtracted from the point is most similar to the point, and calculates the weight of the point according to the minimum value, wherein the smaller the minimum value is, the larger the weight of the point is, and the weight of the point is 1-sigmoid (M) assuming that the minimum value is M. Optionally, the specified range is initially 192, that is, the range searched in the initial time is 192 pixels, and the specified range is adopted synchronously along with the downsampling of the network, and the initial range is a super parameter and can be adjusted according to actual requirements.

Further, in this embodiment, the fourth left feature map and the second left feature map are input to the multiplication unit, the fourth left feature map and the second left feature map are subjected to dot product pixel by the multiplication unit, a preliminary output feature map is obtained, since the calculated fourth feature map is the same as any channel of the second left feature map, the channel number of the second left feature map is assumed to be C, the dot product operation is to copy the fourth left feature map in the channel dimension to make the sizes of the fourth left feature map and the second left feature map completely the same, then directly multiply the elements of the corresponding positions, and then suppress those features of the two input feature maps including parallax edges according to the result. Similarly, the fourth right feature map and the second right feature map are input to the multiplication unit, the multiplication unit is used for carrying out pixel-by-pixel dot product on the fourth right feature map and the second right feature map to obtain a preliminary output feature map, the calculated fourth right feature map and any channel of the second right feature map are the same in size, the channel number of the second right feature map is assumed to be C, the dot product operation is that the fourth feature map is duplicated in the channel dimension by C so that the sizes of the fourth feature map and the second right feature map are identical, then elements at corresponding positions are directly multiplied, and then the features containing parallax edges of the two input feature maps are restrained according to the result.

Further, the invention obtains the final output characteristic diagram (the target left characteristic diagram and the target right characteristic diagram) by adding the original values of the two preliminary output characteristic diagrams (the fifth left characteristic diagram and the fifth right characteristic diagram) and the two input characteristic diagrams (the first left characteristic diagram and the first right characteristic diagram) respectively. Since the preliminary output feature map is exactly the same size as the input feature map, this operation is to directly add points at corresponding positions. This maintains the residual structure of the conventional convolutional neural network module, which has the advantage of facilitating the backward propagation of the deep neural network.

In this embodiment, the original values of the two-input feature map are added to each other pixel by pixel to construct a residual structure, and the output of the residual structure is the output of the module. The module improves the performance of the parallax estimation network at the edges of the object by suppressing the features containing parallax edges, and is constructed based on a common convolutional neural network module, after replacing the common module with the module, the network reduces the inference speed to a very low degree and improves the accuracy to a higher degree. The common convolutional neural network module is widely applied to various tasks, and the embodiment of the invention additionally adds calculation and application of a weight matrix with extremely low time consumption aiming at the stereo matching task on the premise that the structure of the common module is unchanged. When the KI TTI2015 picture set is subjected to stereo matching, the accuracy of PSMAT using the module provided by the invention is improved by 8.62%, and the calculation time is improved by only 0.016s.

S400, obtaining a perspective view according to the target left feature map and the target right feature map.

In this embodiment, finally, a perspective view is obtained according to the target left feature map and the target right feature map. It should be noted that this step is a prior art and will not be described herein.

In summary, compared with the prior art, the embodiment of the invention has the following advantages:

the invention discloses a stereo matching method, which comprises the following steps: acquiring an original image pair obtained by a binocular camera, wherein the image pair comprises a left image and a right image; respectively extracting a first left characteristic image and a first right characteristic image which correspond to the left image and the right image respectively; respectively inputting the first left characteristic diagram and the first right characteristic diagram to a convolutional neural network module to obtain a target left characteristic diagram corresponding to the first left characteristic diagram and a target right characteristic diagram corresponding to the first right characteristic diagram; and obtaining a perspective view according to the target left feature image and the target right feature image, so that more accurate matching is achieved, and meanwhile, the matching efficiency is improved.

Based on the stereo matching method, the invention also provides a stereo matching device, as shown in fig. 5, comprising:

an acquisition module 41 for acquiring an original image pair obtained by a binocular camera, the image pair including a left image and a right image;

an extracting module 42, configured to extract a first left feature map and a first right feature map corresponding to the left image and the right image respectively;

the data module 43 is configured to input the first left feature map and the first right feature map to a convolutional neural network module, respectively, so as to obtain a target left feature map corresponding to the first left feature map and a target right feature map corresponding to the first right feature map;

and the matching module 44 is used for obtaining a perspective view according to the target left characteristic diagram and the target right characteristic diagram.

It should be noted that, as those skilled in the art can clearly understand the specific implementation process of the stereo matching device and each module, reference may be made to the corresponding description in the foregoing stereo matching method embodiment, and for convenience and brevity of description, details are not repeated here.

The above-described stereo matching apparatus may be implemented in the form of a computer program which can be run on a stereo matching device as shown in fig. 6.

Based on the stereo matching method, the invention further provides a computer readable storage medium, wherein the computer readable storage medium stores one or more programs, and the one or more programs can be executed by one or more processors to realize the steps in the stereo matching method described in the embodiment.

Based on the above stereo matching method, the present invention also provides a stereo matching apparatus, as shown in fig. 6, which includes at least one processor (processor) 20; a display screen 21; and a memory (memory) 22, which may also include a communication interface (Communications Interface) 23 and a bus 24. Wherein the processor 20, the display 21, the memory 22 and the communication interface 23 may communicate with each other via a bus 24. The display screen 21 is configured to display a user guidance interface preset in the initial setting mode. The communication interface 23 may transmit information. The processor 20 may invoke logic instructions in the memory 22 to perform the methods of the embodiments described above.

Further, the logic instructions in the memory 22 described above may be implemented in the form of software functional units and stored in a computer readable storage medium when sold or used as a stand alone product.

The memory 22, as a computer readable storage medium, may be configured to store a software program, a computer executable program, such as program instructions or modules corresponding to the methods in the embodiments of the present disclosure. The processor 20 performs functional applications and data processing, i.e. implements the methods of the embodiments described above, by running software programs, instructions or modules stored in the memory 22.

The memory 22 may include a storage program area that may store an operating system, at least one application program required for functions, and a storage data area; the storage data area may store data created according to the use of the terminal device, etc. In addition, the memory 22 may include high-speed random access memory, and may also include nonvolatile memory. For example, a plurality of media capable of storing program codes such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, or a transitory storage medium may be used.

In addition, the specific processes that the storage medium and the plurality of instruction processors in the apparatus load and execute are described in detail in the above method, and are not stated here.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A method of stereo matching, the method comprising:

obtaining a perspective view according to the target left feature map and the target right feature map;

the convolutional neural network module comprises a first module and a second module, wherein the first module and the second module comprise a first unit, a second unit, a weight calculation unit, a multiplication unit and an addition unit, and the first unit, the second unit, the weight calculation unit, the multiplication unit and the addition unit are sequentially connected;

the step of inputting the first left feature map and the first right feature map to a convolutional neural network module respectively to obtain a target left feature map corresponding to the first left feature map and a target right feature map corresponding to the first right feature map specifically includes:

respectively inputting the first left characteristic diagram and the first right characteristic diagram to the first unit to obtain a second left characteristic diagram corresponding to the first left characteristic diagram and a second right characteristic diagram corresponding to the first right characteristic diagram;

inputting the fifth right feature map and the first right feature map to the adding unit to obtain a target right feature map;

the inputting the third left feature map and the third right feature map to the weight calculating unit, to obtain a fourth left feature map and a fourth right feature map respectively specifically includes:

2. The stereo matching method according to claim 1, characterized in that the first unit comprises two third units comprising one 2D convolution layer and 1 activation function.

3. The stereo matching method according to claim 2, wherein the convolution kernel size of the 2D convolution layer is 3 x 3; the activation function is ReLu.

4. The stereo matching method according to claim 2, wherein the weight is calculated by the formula:

1-sigmoid (M), where M is the pixel point of the smallest difference.

5. A stereo matching device, the device comprising:

for each pixel point on the third right feature map, calculating the pixel point which is in the same row as the pixel point on the third left feature map and meets the specified range to obtain a second pixel point with the minimum difference value with the pixel point, calculating the weight of the pixel point according to the second pixel point, and obtaining a fourth right feature map;

6. A stereo matching device, the device comprising: a processor and a memory and a communication bus; the memory has stored thereon a computer readable program executable by the processor;

the processor, when executing the computer readable program, implements the steps of the stereo matching method as defined in any one of claims 1 to 4.

7. A computer-readable storage medium storing one or more programs executable by one or more processors to implement the steps in the stereo matching method as recited in any one of claims 1-4.