CN110599489A

CN110599489A - Target space positioning method

Info

Publication number: CN110599489A
Application number: CN201910792381.4A
Authority: CN
Inventors: 韩守东; 夏晨斐; 陈国荣; 刘巾英
Original assignee: Huazhong University of Science and Technology
Current assignee: Huazhong University of Science and Technology
Priority date: 2019-08-26
Filing date: 2019-08-26
Publication date: 2019-12-20

Abstract

The invention discloses a target space positioning method, which comprises the following steps: simultaneously collecting two target images with different visual angles, and positioning through a binocular visual space to obtain a three-dimensional coordinate set of each pixel point in one image; classifying and regressing the target in the image based on example segmentation to obtain a target binary mask set; and mapping and fusing pixel point coordinates based on the three-dimensional coordinate set of each pixel point and the target binary mask set to obtain a target three-dimensional coordinate in the image, so as to realize target space positioning. Based on binocular vision space positioning, acquiring sparse three-dimensional coordinates for describing real scale and space positioning information of a target; and performing monocular instance segmentation on the interested specific class target by adopting a deep learning method, and accurately defining the semantic attribute of the pixel. And finally, based on the coupling of the three-dimensional coordinates and the example segmentation result, positioning the target space under the connection relation of the pixel coordinates, so that the three-dimensional coordinates which are scattered sparsely are densified, and the positioning accuracy is improved.

Description

Target space positioning method

Technical Field

The invention belongs to the field of target space positioning, and particularly relates to a target space positioning method.

Background

With the continuous development of production and life, the position information of the target is concerned by more and more fields, and the target space positioning has wide application in many application scenes, such as factory dangerous area alarm, automatic driving obstacle prediction, aerospace position and attitude estimation and the like.

The existing target space positioning means mainly include three types, namely hardware auxiliary methods, traditional geometric methods and depth estimation methods based on deep learning. The hardware auxiliary method mainly refers to the steps of analyzing and calculating a positioning target through transmitting/receiving active signals by means of radio frequency identification, particle filtering, WI-FI, Bluetooth, radar, multi-sensor fusion and the like. The traditional geometric method is a method for acquiring three-dimensional coordinates of a target in geometric modes such as camera calibration or optical instrument measurement. The deep estimation method based on deep learning mainly refers to a method for carrying out deep estimation after collecting samples containing deep information and training by using a deep learning network.

However, the three methods have the defects that signal transmission is easily interfered and even shielded, point clouds are sparse and difficult to be utilized in a centralized manner, scene mobility is poor, and real scale information cannot be restored. Therefore, how to overcome the above defects to improve the accuracy of target spatial positioning is a technical problem to be solved in the field.

Disclosure of Invention

The invention provides a target space positioning method, which is used for solving the technical problem that the existing target space positioning method cannot overcome inherent defects and further causes insufficient space positioning precision due to independent adoption of a single type method such as a geometric method, a deep learning method and the like.

The technical scheme for solving the technical problems is as follows: a method of spatial localization of an object, comprising:

s1, simultaneously collecting two target images with different visual angles, and positioning through a binocular visual space to obtain a three-dimensional coordinate set of each pixel point in one of the images;

s2, carrying out classification regression on the target in the image based on example segmentation to obtain a target binary mask set;

and S3, mapping and fusing pixel point coordinates based on the pixel point three-dimensional coordinate sets and the target binary mask set to obtain target three-dimensional coordinates in the image, and realizing target space positioning.

The invention has the beneficial effects that: based on binocular vision space positioning, acquiring sparse three-dimensional coordinates (actual distance values relative to an original point) for describing real scale and space positioning information of a target; by adopting a deep learning method, monocular instance segmentation is carried out on the interested specific class target, the semantic attribute of the pixel is accurately defined, the target contour is finer, and the huge deviation possibly caused by introducing a large number of non-target pixel points into the target position indicated by a rectangular frame is avoided. And finally, based on the coupling of the three-dimensional coordinates and the example segmentation result, carrying out target space positioning under the connection relation of the pixel coordinates, so that the sparsely-dispersed three-dimensional coordinates are densified to position the target, and the accuracy of the target space positioning is improved. The invention improves the space positioning effect of the traditional visual positioning method, and can quickly and accurately realize the three-dimensional accurate positioning of the target under the real space scene based on the real space size and combined with deep learning.

On the basis of the technical scheme, the invention can be further improved as follows.

Further, in S1, the simultaneously acquiring two target images at different viewing angles specifically includes: and simultaneously acquiring two target images with different visual angles by adopting a calibrated binocular camera or two monocular cameras.

The invention has the further beneficial effects that: by adopting a binocular camera or two monocular cameras, the similarity (including the image size) of the acquired target images of two different visual angles can be maximized, so that the accuracy of the three-dimensional coordinates is improved.

Further, in S1, the binocular vision space positioning specifically includes:

and performing image correction, stereo matching and depth recovery on the two target images with different visual angles by adopting a visual principle.

Further, in S2, the performing classification regression on the target in the image specifically includes:

and performing classification regression on the targets in the image by adopting a Mask R-CNN algorithm.

The invention has the further beneficial effects that: and performing monocular instance segmentation on the interested specific class target by adopting a Mask R-CNN algorithm, and taking the segmented object as foreground/background prior information, thereby being beneficial to realizing the classification of pixels inside and outside the target contour and accurately defining the attribute of the image pixel.

Further, the S2 includes:

s2.1, carrying out classification regression on the target in the image based on example segmentation to obtain a target contour and a target binary mask set of a delineating region of the target contour;

s2.2, reducing the target contour through a convolution kernel to obtain a new target contour and a target binary mask set of a delineating area of the new target contour.

The invention has the further beneficial effects that: in order to eliminate non-target errors introduced by the target contour edge, pixel corrosion is introduced, edge error optimization is carried out, and target precision is provided.

Further, in S2.2, the narrowing the target contour specifically includes:

and adopting a convolution kernel to scan all the pixel points in sequence, carrying out AND or NOT calculation on binary masks of all the pixel points covered by the convolution kernel when any pixel point is scanned, and updating the binary masks of all the covered pixel points according to the calculation result so as to reduce the target contour.

The invention has the further beneficial effects that: the convolution kernel scans and operates among elements in the convolution kernel, and the value of each element in the convolution kernel is updated, so that the aim of reducing the target contour is fulfilled, the corrosion effect is realized, the calculation amount is small, and the method is convenient and quick.

Further, the S3 includes:

s3.1, extracting a target three-dimensional coordinate set corresponding to the target pixel point coordinate set from the target pixel point three-dimensional coordinate set corresponding to the target binary mask set;

and S3.2, performing mean filtering on the target three-dimensional coordinate set to obtain a target average three-dimensional coordinate, and realizing target space positioning.

The invention has the further beneficial effects that: aiming at the problems that a large amount of noise is easily introduced due to the fact that a three-dimensional coordinate set of each pixel point is obtained only by binocular visual space positioning, and the three-dimensional coordinates are sparse and difficult to use in a centralized mode, a target pixel point coordinate set obtained by example segmentation is fused, fine outlines and semantic classification of targets can be obtained, a pixel-level optimization algorithm for the targets is added beneficially, and positioning accuracy is further improved beneficially.

Further, before S3.1, the S3 further includes:

and uniformly and randomly sampling a target pixel point coordinate set corresponding to the target binary mask set to obtain a new target pixel point coordinate set, and executing the S3.1.

The invention has the further beneficial effects that: through even random sampling, can practice thrift computational resource, improve the positioning speed, the operation of even random sampling has the error optimization effect simultaneously, and very big degree reduces marginal error, further improves the location rate of accuracy.

Further, before S3.1 and after S3.2, the S3 further includes:

and based on the new target three-dimensional coordinate set, carrying out background point filtering on the target pixel point coordinate set to obtain a new target three-dimensional coordinate set.

The invention has the further beneficial effects that: and the error optimization is carried out by utilizing the background pixel point filtering, so that the edge error and the background interference can be greatly reduced, and the positioning accuracy is further improved.

The invention also provides a storage medium, wherein the storage medium stores instructions, and when the instructions are read by a computer, the computer is enabled to execute any one of the target space positioning methods.

Drawings

Fig. 1 is a flowchart of a target space positioning method according to an embodiment of the present invention;

FIG. 2 is a comparison graph of pixel corrosion effects provided by an embodiment of the present invention;

fig. 3 is a network frame diagram of target space positioning based on fusion of three-dimensional coordinates of all pixel points and a target binary mask set according to an embodiment of the present invention;

fig. 4 is a flowchart illustrating a coupling process between a three-dimensional coordinate and a target binary mask set according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.

Example one

A method 100 for spatial localization of an object, as shown in fig. 1, comprising:

step 110, simultaneously collecting two target images with different visual angles, and positioning through a binocular visual space to obtain a three-dimensional coordinate set of each pixel point in one of the images;

step 120, classifying and regressing the target in the image based on example segmentation to obtain a target binary mask set;

and step 130, mapping and fusing pixel point coordinates based on the pixel point three-dimensional coordinate sets and the target binary mask set to obtain target three-dimensional coordinates in the image, so as to realize target space positioning.

Acquiring sparse three-dimensional coordinates (actual distance values relative to an origin) for describing real dimensions of a target and space positioning information based on binocular vision space positioning; by adopting a deep learning method, monocular instance segmentation is carried out on the interested specific class target, the semantic attribute of the pixel is accurately defined, the target contour is finer, and the huge deviation possibly caused by introducing a large number of non-target pixel points into the target position indicated by a rectangular frame is avoided. And finally, positioning a target space based on the coupling of the three-dimensional coordinates and the example segmentation result, and particularly fusing the three-dimensional coordinates and the target binary mask set to densify the sparse three-dimensional coordinates and improve the accuracy of positioning the target space. The embodiment improves the space positioning effect of the traditional visual positioning method, and can quickly and accurately realize the three-dimensional accurate positioning of the target under the real space scene.

Preferably, in step 110, two target images with different viewing angles are acquired simultaneously, specifically, two target images with different viewing angles are acquired simultaneously by using a calibrated binocular camera or two monocular cameras.

For example, the experimental environment is an indoor laboratory scene, two Logitech/Robotic C920 network cameras with the resolution of 640 × 480 are used, the target category is designated as a pedestrian, and target images at the same time, the same scene and different viewing angles are obtained.

By adopting a binocular camera or two monocular cameras, the similarity (including the image size) of the acquired target images of two different visual angles can be maximized, so that the precision of three-dimensional coordinates is improved.

Preferably, in step 110, the binocular vision space positioning specifically includes: and (3) performing image correction, stereo matching and depth recovery on two target images with different visual angles by adopting a visual principle.

The stereo matching method can be a BM method, and three-dimensional coordinates are derived according to parallax.

Based on the captured images, 307200 sets of data were obtained, each set including X, Y, Z coordinate values with the left camera as the origin.

Preferably, in step 120, a classification regression is performed on the target in the image, specifically: and performing classification regression on the targets in the image by adopting a Mask R-CNN algorithm. The Mask R-CNN algorithm uses a Region generation Network (RPN) to find regions of Interest (RoI) in an input image, where each Region of Interest includes a detection target. And classifying and regressing the target by using a pre-training model, outputting the class and frame positioning information of the target, and generating and outputting binary mask information of the target by using a convolution network.

For example, a Mask R-CNN algorithm is adopted to perform example segmentation on the left image, and 640 × 480 binary masks of the pedestrian category are classified and regressed to output.

Preferably, step 120 includes:

step 121, classifying and regressing the target in the image based on example segmentation to obtain a target binary mask set of a target contour and a target delineated region;

and step 122, reducing the target contour through a convolution kernel to obtain a new target contour and a target binary mask set of the delineating area.

In order to eliminate non-target errors introduced by the target contour edge, pixel corrosion is introduced, edge error optimization is carried out, and target positioning precision is provided.

Preferably, in step 122, the narrowing of the target contour specifically includes: and sequentially scanning all the pixel points by adopting a convolution kernel, carrying out AND or NOT calculation on binary masks of all the pixel points covered by the convolution kernel when any pixel point is scanned, updating the binary masks of the pixel points in the covered area according to a calculation result, and reducing the target contour.

For example, in order to eliminate non-target errors introduced by an edge, pixel erosion is added to perform edge error optimization, and the specific structure of the edge error optimization is as follows: selecting a convolution kernel with the size of n multiplied by n and the value of 1 as a corrosion factor, and dividing the binary mask by using a corrosion factor scanning example. Supposing that the target binary mask is 1 and the non-target binary mask is 0, carrying out AND operation on corresponding positions in each scanning, taking the minimum value as a scanning result, and taking the value of each pixel point as at most n experienced by the pixel point²Minimum of the secondary scan results. And updating all pixel values after the corrosion factor scans the whole image to obtain a final mask of the example segmentation.

Specifically, convolution kernels with the sizes of 3 × 3, 5 × 5 and 10 × 10 and the values of 1 can be selected as the two-value mask for dividing the corrosion factor scanning example, and operation is performed on corresponding positions in each scanning, and the minimum value is taken as a scanning result. The value of each pixel point is the minimum value of the scanning result of at most 9, 25 or 100 times that the pixel point undergoes, and all pixel values are updated after the full image is scanned, so as to obtain an example target final mask, as shown in fig. 2, (a) in fig. 2 is that no corrosion target mask is introduced, (b) in fig. 2 is a 3 × 3 corrosion target mask, (c) in fig. 2 is a 5 × 5 corrosion target mask, and (d) in fig. 2 is a 10 × 10 corrosion target mask, and as the size of the convolution kernel increases, the target profile becomes smaller and smaller. Thus, the greater the erosion factor (i.e., convolution kernel size), the greater the pixel erosion, and the lower the likelihood of introducing edge errors.

Preferably, step 130 includes:

131, extracting a target three-dimensional coordinate set corresponding to the target pixel point coordinate set from the target pixel point three-dimensional coordinate sets according to the target pixel point coordinate sets corresponding to the target binary mask set;

and 132, performing mean filtering on the target three-dimensional coordinate set to obtain a target average three-dimensional coordinate, and realizing target space positioning.

An overall network framework fusing the three-dimensional coordinates and the example segmentation Mask results is shown in fig. 3, the three-dimensional coordinates of all pixel points in an image are derived through binocular visual space positioning (namely, the content shown by a left dotted line frame in fig. 3), an example final target binary Mask is obtained through a Mask R-CNN network and edge error optimization (such as contour corrosion), and finally, a target space three-dimensional positioning result is generated under fusion calculation (such as shown in fig. 4). It should be noted that the stereo image pair in fig. 3 is two target images with different viewing angles acquired simultaneously, and here, the target images can be divided into a left image and a right image. The left image is used here for instance segmentation, depending on the processing software. The "masks" in fig. 3 each represent a target binary mask set to which the target corresponds.

Based on the final binary mask of the example target, the mask is firstly analyzed and all pixel values are detected, and two-dimensional coordinates of pixel point images with pixel values not being 0(0 represents the binary mask of the background part, and 1 represents the binary mask of the target part) are stored. Then importing three-dimensional coordinates of all pixel points, specifically including three-dimensional coordinates of space points corresponding to all pixel points; and finally, extracting three-dimensional coordinates corresponding to the example target sample points, and performing mean value filtering calculation on the extracted three-dimensional coordinates to obtain a target positioning result.

Preferably, before step 131, step 130 further comprises: and uniformly and randomly sampling a target pixel point coordinate set corresponding to the target binary mask set to obtain a new target pixel point coordinate set, and executing the step 131.

In order to save computing resources and improve positioning speed, the extracted target pixel point coordinate set needs to be uniformly and randomly sampled, which specifically includes: a uniform randomly sampled network. The input of the network is a target binary mask set divided by an example, and the coordinates of all pixel points of a target can be obtained. If M is greater than or equal to N, then everyEach group of the pixel points is randomly sampled to be used as a sample point, and N sample points are obtained; if M is<N, each pixel point is a sample point, and M sample points are total, wherein M is the coordinate number of the target pixel point, and N is the sampling point number of the target pixel point

For example, based on the final target binary mask set of the example, there are 57297 target pixel points, the number of sampling points is set to 3000 according to the computing resources, each 19 pixel points are in one group, one pixel point is randomly sampled in each group as a sample point, there are 3000 sample points, and the sample points are uniformly distributed, so that the overall spatial position of the target can be reflected.

And an error optimization link is designed, uniform random sampling is introduced for error optimization, so that edge errors and background interference are greatly reduced, and the positioning accuracy is further improved.

Preferably, after step 131 and before step 132, step 130 further comprises: and based on the new target three-dimensional coordinate set, carrying out background point filtering on the target pixel point coordinate set to obtain the new target three-dimensional coordinate set.

In order to further improve the positioning accuracy, the extracted target pixel sample points need to be filtered by background points, so that the target three-dimensional coordinate set is ensured to introduce fewer non-target pixel point coordinates. The specific structure is a background filtering network, the input of the network is target pixel sample points (namely a new target pixel point coordinate set) which are uniformly and randomly sampled, three-dimensional coordinates of all the sample points are extracted, the sample point depths are clustered, and background type pixel points are filtered.

For example, based on 3000 sample points after uniform random sampling, three-dimensional coordinates of all sample points are extracted, and the sample point depths are clustered. In this embodiment, when the depth mean of one class is greater than twice the depth mean of another class and the number of members is less than one fifth of the depth mean of another class, the class is determined as a background class and filtered. In this embodiment, the real depth of the example target is 50cm, and the depth clustering of the sample points is two types: the mean depth value of the class A is 49.21cm, and the class A contains 2977 sample points; the mean depth of class B was 100000cm, containing 23 sample points. Therefore, the B class is judged to be the background class, the B class is filtered, an error optimization link is designed for the final positioning result of 49.21cm, error optimization is carried out by utilizing background pixel point filtering, edge errors and background interference are greatly reduced, and the positioning accuracy is further improved.

Example two

A storage medium having stored therein instructions, which when read by a computer, cause the computer to execute any one of the target space positioning methods described in the first embodiment above.

The related technical solution is the same as the first embodiment, and is not described herein again.

It will be understood by those skilled in the art that the foregoing is only a preferred embodiment of the present invention, and is not intended to limit the invention, and that any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A method for spatially locating an object, comprising:

2. The method according to claim 1, wherein in S1, the step of simultaneously acquiring two target images from different viewing angles includes: and simultaneously acquiring two target images with different visual angles by adopting a calibrated binocular camera or two monocular cameras.

3. The method for spatial localization of objects according to claim 1, wherein in S1, the binocular vision spatial localization specifically comprises:

4. The method for spatial localization of objects according to claim 1, wherein in S2, the classifying regression of the objects in the image is specifically:

5. The method as claimed in claim 1, wherein said S2 includes:

6. The method according to claim 5, wherein in S2.2, the reducing the target contour specifically comprises:

7. The method according to any one of claims 1 to 6, wherein said S3 includes:

8. The method as claimed in claim 7, wherein said S3 further includes, before said S3.1:

9. The method as claimed in claim 7, wherein after s.1 and before S3.2, the step S3 further includes:

10. A storage medium having stored thereon instructions which, when read by a computer, cause the computer to carry out a method of spatial localization of an object as claimed in any one of claims 1 to 9.