CN113298781B

CN113298781B - Mars surface three-dimensional terrain detection method based on image and point cloud fusion

Info

Publication number: CN113298781B
Application number: CN202110565199.2A
Authority: CN
Inventors: 高�浩; 黄卫; 胡海东
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing University of Posts and Telecommunications
Priority date: 2021-05-24
Filing date: 2021-05-24
Publication date: 2022-09-16
Anticipated expiration: 2041-05-24
Also published as: CN113298781A

Abstract

The invention discloses a Mars surface three-dimensional terrain detection method based on image and point cloud fusion, which comprises the following steps: acquiring image data of a region to be detected on the surface of the Mars; taking the image data of the area to be detected as the input of the trained three-dimensional target detection network; determining a terrain detection result of a region to be detected on the surface of the Mars according to the output of the three-dimensional target detection network; visually outputting a terrain detection result; according to the method, the image and the point cloud information are fused and utilized through a deep learning method, and the problem of Mars surface target terrain three-dimensional space detection is effectively solved.

Description

Mars surface three-dimensional terrain detection method based on image and point cloud fusion

Technical Field

The invention relates to a Mars surface three-dimensional terrain detection method based on image and point cloud fusion, and belongs to the technical field of Mars intelligent visual three-dimensional target detection and identification.

Background

The control of the landform and topography of mars is one of the bases for scientific research of mars, and the detection of the composition and distribution of the ground substance of mars is the key to the control of the landform and topography. Scientists can obtain a great deal of information just looking at the distribution and physical properties of rocks. Therefore, in the task of detecting the motion of the train, firstly, the train needs to identify and judge the environment of the train, acquire information through sensors such as a camera and the like, detect spatial information and category information of a target terrain, then continue to plan a route to determine the traveling direction, and use a detector to perform the next detection research.

Compared with 2D target detection, 3D target detection has obvious advantages, the mars train is ensured to safely travel, further survey and research are carried out on a specific target terrain, three-dimensional pose information of the terrain is needed, depth information is not carried by a two-dimensional pose in a picture, the traditional 2D target detection is to obtain the category of a target object and a detection frame in an image plane, the included parameters comprise the category, the central coordinate and the length and the width of the detection frame, and the 3 groups of parameters form description on the information such as the position and the size of the target object. The absolute scale and the position of an object cannot be obtained from the 2D picture, no method is available for effectively avoiding collision, and certain potential safety hazards are caused. Therefore, the 3D target detection of the Mars surface terrain plays an important role in the aspects of exploration research, movement, path planning and the like of the Mars vehicle, and has important significance for the target exploration of the Mars surface.

Disclosure of Invention

The invention aims to overcome the defects in the prior art, and provides a mars surface three-dimensional terrain detection method based on image and point cloud fusion.

In order to achieve the purpose, the invention is realized by adopting the following technical scheme:

the invention provides a Mars surface three-dimensional terrain detection method based on image and point cloud fusion, which comprises the following steps:

acquiring image data of a region to be detected on the surface of the Mars;

taking the image data of the area to be detected as the input of the trained three-dimensional target detection network;

determining a terrain detection result of a region to be detected on the surface of the Mars according to the output of the three-dimensional target detection network;

visually outputting a terrain detection result;

the image data of the region to be detected on the surface of the Mars are obtained by driving a calibrated camera set to move on the surface of the Mars through a trolley robot, and the calibrating of the camera set comprises calibrating a depth camera by using a calibration plate and obtaining camera internal parameters and camera external parameters;

the camera set comprises a color camera and a depth camera, and the image data comprises an RGB color image and a depth image.

Further, the training of the three-dimensional target detection network includes:

acquiring image data of a Mars simulation field, namely an RGB color image and a depth image;

converting the depth image of the Mars simulation field into a three-dimensional point cloud, and preprocessing the three-dimensional point cloud data;

labeling an RGB color image of a Mars simulation field to obtain a color image data set, and labeling the preprocessed three-dimensional point cloud data to obtain a three-dimensional point cloud data set;

the color graphic data set is used as the input of a two-dimensional target detection network, a Mars simulation field is detected, and a two-dimensional detection result is output;

taking the two-dimensional detection result and the three-dimensional point cloud data set as input, and extracting conical point cloud of a target area in the three-dimensional point cloud;

using the conical point cloud as the input of a three-dimensional target detection network, detecting the Mars simulation field and outputting a three-dimensional detection result;

and performing iterative training on the three-dimensional target detection network based on the three-dimensional detection result.

The image data of the surface of the Mars simulation field is obtained by driving a calibrated camera set to move in the Mars simulation field through a trolley robot.

Further, the acquiring of the image data includes:

adjusting the acquisition angle and the illumination environment of the camera set; and controlling the trolley robot to move by using a ros system according to preset parameters and routes, recording the terrain environment by the camera group to obtain a bag file, transmitting and storing the bag file, and analyzing the bag file according to the timestamp to obtain an RGB (red, green and blue) color image and a depth image of each frame.

Further, the converting the depth image into a three-dimensional point cloud comprises:

the camera internal reference is used as the constraint adjustment of coordinate transformation to convert the depth image into three-dimensional point cloud, and the formula is as follows:

wherein x, y, z are three-dimensional point cloud coordinates, x 'and y' are depth image coordinates, f _x And f _y Is the focal length in the camera parameters, and D is the depth value of the depth image.

Further, the three-dimensional point cloud coordinates are stored in a pcd point cloud format and a two-level bin file mode.

Further, the taking the color graphic data set as an input of the two-dimensional object detection network and outputting a two-dimensional detection result includes:

the color image data set is input into a two-dimensional target detection network Yolov5 to be trained to obtain weights, the trained weight files are used for predicting and identifying the target area to obtain the category and the position of the target area in the two-dimensional image, and the category and the position are output as two-dimensional detection results.

Further, the extracting the cone-shaped point cloud of the target area in the three-dimensional point cloud by using the two-dimensional detection result and the three-dimensional point cloud data set as input includes:

converting a three-dimensional point cloud data set under a depth camera coordinate system into a two-dimensional image under a color camera coordinate system based on coordinate conversion of the depth camera and the color camera;

taking the position in the two-dimensional detection result as a target area in the two-dimensional image, and converting the two-dimensional image in the target area into a three-dimensional point cloud data set, namely a cone-shaped point cloud, in the target area based on the coordinate conversion of the depth camera and the color camera;

the formula for converting the coordinates of the depth camera and the color camera is as follows:

P _rgb ＝RP _ir +T

wherein, P _rgb And P _ir Projection coordinate points under the color camera and the depth camera, respectively, R, T are a rotation matrix and a translation matrix of the camera extrinsic seed.

Further, the preprocessing of the three-dimensional point cloud data comprises format conversion and data normalization;

the marking of the RGB color image of the mars simulation field comprises marking operation by using label img marking software, wherein the marking comprises positions and types;

and the marking of the preprocessed three-dimensional point cloud data comprises marking operation by adopting a ros system, wherein the marking comprises a three-dimensional boundary bounding box and a spatial position.

Further, the step of using the cone-shaped point cloud as an input of a three-dimensional target detection network, detecting the mars simulation site and outputting a three-dimensional detection result comprises:

extracting the characteristics of the conical point cloud by adopting an improved PointSIFT network, thereby performing three-dimensional semantic segmentation;

performing feature extraction and prediction on the segmented object point cloud by using a T-Net sub-network so as to obtain the spatial position of the terrain of a target area;

performing network classification and regression on the object point cloud obtained by segmentation by using a box regression subnetwork, thereby obtaining a three-dimensional boundary bounding box of the target area terrain;

and outputting the space position and the three-dimensional boundary bounding box as a three-dimensional detection result.

Further, the size of the RGB color image is 1920x1080 pixels, and the RGB color image comprises three channels of RGB; the depth image has a size of 512x424 pixels, each pixel has 16 bits, occupies 2 bytes, and represents depth data by pixel, i.e., an actual distance, in millimeters.

Compared with the prior art, the invention has the following beneficial effects:

the Mars surface three-dimensional terrain detection method based on image and point cloud fusion comprises sub-modules of camera calibration, data acquisition, data preprocessing, data annotation, two-dimensional image terrain detection and three-dimensional terrain detection, result visualization and the like, image and point cloud information are fused, a neural network is used for being faster, more accurate and more stable than a traditional method, and the automation degree and the accuracy of terrain detection are improved; according to the method, the image and the point cloud information are fused by a deep learning method and applied to a Mars three-dimensional terrain detection scene, an original semantic segmentation network in a Frustum PointNets network is replaced, point cloud semantic segmentation is more accurately realized by using an improved PointSIFT-based network, meanwhile, the network is simplified, the detection speed is improved, the T-Net network and a box estimation sub-network are reserved to realize target space positioning and bounding box parameter regression, the space information of the target terrain is finally obtained and the result is visualized, and the problems of Mars surface terrain detection and positioning in the detection process of a Mars vehicle are effectively solved.

Drawings

Fig. 1 is a flowchart of a method for detecting a three-dimensional terrain on a surface of a mars according to an embodiment of the present invention;

FIG. 2 is a schematic structural diagram of a data acquisition device according to an embodiment of the present invention;

fig. 3 is a flowchart illustrating a three-dimensional target detection network according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of an improved PointSIFT network of the present invention;

FIG. 5 is a diagram of a three-dimensional object detection network architecture in accordance with the present invention;

labeled in the figure as:

1. the system comprises a steering wheel 2, a motor control system 3, a camera set 4 and a laser radar.

Detailed Description

The invention is further described below with reference to the accompanying drawings. The following examples are only for illustrating the technical solutions of the present invention more clearly, and the protection scope of the present invention is not limited thereby.

The embodiment provides a mars surface three-dimensional terrain detection method based on image and point cloud fusion, as shown in fig. 1, the method comprises the following steps:

step 101, acquiring image data of a region to be detected on the surface of a Mars;

step 102, taking image data of a region to be detected as input of a trained three-dimensional target detection network;

103, determining a terrain detection result of the region to be detected on the surface of the Mars according to the output of the three-dimensional target detection network;

and 104, visually outputting the terrain detection result, namely visualizing the predicted result in the two-dimensional image and the three-dimensional point cloud respectively by obtaining the space position and the three-dimensional boundary bounding box of the mars terrain.

As shown in fig. 2, image data of a region to be detected on the surface of the mars is obtained by driving a calibrated camera set to move on the surface of the mars through a trolley robot, and the calibrating of the camera set comprises calibrating a depth camera by using a calibration plate and obtaining internal parameters and external parameters of the camera; the camera set includes a color camera and a depth camera, and the image data includes RGB color images and depth images. The size of the RGB color image is 1920x1080 pixels, and the RGB color image comprises three channels of RGB; the depth image has a size of 512 × 424 pixels, each of which has 16 bits and occupies 2 bytes, and depth data, i.e., an actual distance, is expressed in millimeters by pixels.

And the data acquisition device calibrates the camera set 3 by using the calibration plate, acquires the internal and external parameters of the camera, and adjusts the acquisition angle and the illumination environment of the camera set. A camera set 3 is carried by a trolley robot to collect original data in a mars simulation field, and the trolley mainly comprises a steering wheel 1, a chassis, a motor control system 2, a laser radar 4 and the like. When data are collected, hardware devices such as a motor control system 2 and a laser radar 4 are used, a ros system is used for controlling and moving the trolley robot, a camera set is used for recording bag files and realizing transmission and storage of the data, and the bag files are analyzed according to timestamps to complete preliminary collection of the data.

As can be seen from the above steps, the three-dimensional target detection network needs to be trained first, as shown in fig. 3, the method includes the following steps:

step 201, acquiring image data of a mars simulation field, namely an RGB (red, green and blue) color image and a depth image; in the embodiment, a Mars simulation field of a certain space yard is used for collecting a simulated Mars terrain, and the acquisition of the image data of the Mars simulation field is the same as the acquisition principle of the image data of the region to be detected on the surface of the Mars.

Step 202, converting the depth image of the mars simulation field into a three-dimensional point cloud, and performing preprocessing operation on the three-dimensional point cloud data, wherein the preprocessing comprises format conversion and data normalization;

the conversion of the depth image into a three-dimensional point cloud comprises: the camera internal reference is used as the constraint adjustment of coordinate transformation to convert the depth image into three-dimensional point cloud, and the formula is as follows:

where x, y, z are three-dimensional point cloud coordinates, x 'and y' are depth image coordinates, f _x And f _y The three-dimensional point cloud coordinate is stored in a pcd point cloud format and a two-level bin file mode.

Step 203, labeling the RGB color image of the mars simulation field by using labelImg labeling software to obtain a color image data set, wherein the labeling comprises positions and categories, and labeling the preprocessed three-dimensional point cloud data by using a ros system to obtain a three-dimensional point cloud data set, wherein the labeling comprises a three-dimensional boundary bounding box and a space position.

Step 204, taking the color graphic data set as the input of a two-dimensional target detection network, detecting a mars simulation field and outputting a two-dimensional detection result; specifically, the color image data set is input into a two-dimensional target detection network Yolov5 to be trained to obtain weights, the trained weight files are used for predicting and identifying the target area, the category and the position of the target area in the two-dimensional image are obtained, and the category and the position are output as two-dimensional detection results.

Yolov5 still uses the idea of Multi-Scale Training in Yolov2 to fine-tune the input size of the network after every few iterations. It uses Darknet-53 to remove the front 52 layers of the fully connected layer. This is so that: the network structure can have good classification results on ImageNet, thereby showing that the network can learn good characteristics. Compared with ResNet-152 and ResNet-101, Darknet-53 has not only little difference in classification accuracy, but also better calculation speed than ResNet-152 and ResNet-101, and more concise network structure. The reason why the Darknet-53 adopts a layer-hopping connection mode of a residual network and has better performance than the ResNet-152 and ResNet-101 deep networks is that the difference of basic units of the networks, the fewer the number of network layers, the fewer parameters and the less calculation amount are needed. Regarding the classification predicted value, the Yolov5 adopts a Logistic function to replace a Softmax function, so that the classifications are independent from each other, and the class decoupling is realized. With respect to the position predictor, Yolov5 does not typically predict the exact coordinates of the bounding box center, which predicts the offset associated with the upper left corner of the grid cell of the prediction target.

Step 205, taking the two-dimensional detection result and the three-dimensional point cloud data set as input, and extracting conical point cloud of a target area in the three-dimensional point cloud; the method specifically comprises the following steps:

taking the two-dimensional detection result and the three-dimensional point cloud data set as input, and extracting the cone point cloud of the target area in the three-dimensional point cloud comprises the following steps:

P _rgb ＝RP _ir +T

The size of the bounding box of the center pixel position of the candidate box of the target area terrain on the two-dimensional image is as follows:

b _x ＝σ(t _x )+c _x

b _y ＝σ(t _y )+c _y

wherein, (bx, by) is the coordinate of the center point of the bounding box needing to be predicted, b _w Is the width of the bounding box, b _h Is the height of the bounding box (t) _x ,t _y ) A coordinate offset value that is a predicted coordinate of a center point of the bounding box; t is t _w And t _h Is scale scaling, and outputs the offset between 0 and 1, t, respectively through sigmoid function _w And c _x Adding to obtain the X-axis coordinate of the central point of the bounding box, t _h And c _y Adding to obtain a Y-axis coordinate of the central point of the boundary frame; (c) _x ,c _y ) Is the position coordinate relative to the upper left corner of the bounding box; p is a radical of _w 、p _h The width and the height of the preset candidate frame mapped to the feature map are the width and the height of the anchor frame manually set; t is t _w And p _w After the action, the width b of the bounding box is obtained _w ,t _h And p _h After the action, the height b of the bounding box is obtained _h Then, the coordinates of the top left vertex and the bottom right vertex of the candidate frame are obtained by the following formula.

Knowing the projection matrix of the camera, the two-dimensional bounding box can be lifted to a cone that defines a three-dimensional search of the object.

Step 206, using the cone-shaped point cloud as an input of a three-dimensional target detection network, detecting the mars simulation site and outputting a three-dimensional detection result, specifically comprising:

performing feature extraction and prediction on the segmented object point cloud by using a T-Net sub-network so as to obtain the spatial position of the terrain of the target area;

As shown in FIG. 4, the semantic segmentation network result in the F-PointNet network directly affects the precision of localization and bounding box regression. Therefore, the point cloud semantic segmentation is carried out by using the PointSIFT-based sub-network, so that the target point cloud and the spatial characteristics thereof can be better learned, and the point cloud classification accuracy can be improved. The PointSIFT is advantageous in that it takes into account the orientation-encoding (OE) and scale-perception (scale-aware) of the point cloud. The direction coding can enable the point cloud to sense information in different directions, and the characteristics of other point clouds around the point cloud are fused. The scale perception can enable the network to continuously change the weight parameters in the learning process so as to learn the size which is most suitable for extracting the point cloud features. The basic module of PointSIFT is a directional coding unit, which extracts features in 8 directions (up, down, left, right, up-left, down-left, up-right, down-right). By stacking multiple directional-encoding units (orientation-encoding units), information of different scales can be perceived by OE units on different layers, i.e. the capability of scale perception is provided. Meanwhile, the network uses SA (set interaction) and FP (feature propagation) modules based on the PointNet + + network, and the main tasks are down sampling and up sampling respectively. Finally, the output of the decoder is connected to the fully-connected layer for predicting the probability of each class.

Meanwhile, the input is the view cone obtained in the third step, so that the number of point clouds input into the semantic segmentation network is greatly reduced, the network is more simplified in order to increase the overall inference speed of the network, and the speed and the precision of the semantic segmentation network are remarkably improved by a model compression method. Specifically, the original point cloud is used as a network input, an n × D matrix is given as an input, which describes a point set with size n, each point has D-dimensional features, only X, Y, Z coordinates of 3D points are considered, and then D is 3. In fig. 3, the number of layers below is the shape of the corresponding output point set. For example, a first 1024x64 layer means 1024 dots, each dot having 64 characteristic channels. And then taking the output of the PointSIFT module as the input of a downsampling stage of the SA module to extract higher dimensional features. In the up-sampling stage, the FP module is used as a decoder, and the output of the FP module is used as the input of PointSIFT, so that the point cloud network learns the feature information in different directions and different scales. Finally, the output of the decoder is connected to the fully-connected layer for predicting the probabilities of the various classes.

After three-dimensional semantic segmentation, as shown in fig. 5, a probability score is predicted for each point, and the point cloud of the object of interest can be extracted and classified. After these segmented target points are obtained, their coordinates are further normalized to improve the translational invariance of the algorithm. F-Pointnet converts the point cloud to local coordinates by subtracting XYZ values from the centroid and estimates the center of the real object through a lightweight Net T-Net. And finally, estimating a three-dimensional boundary box of the object facing to any direction by the F-PointNet network through a box regression point network and a network for preprocessing point cloud conversion. For a given object, the object point cloud under the three-dimensional object coordinates outputs the object class score, but a parameter of a three-dimensional frame. A three-dimensional bounding box is parameterized by its center (cx, cy, cz), size (h, w, l), etc.

And step 207, performing iterative training on the three-dimensional target detection network based on the three-dimensional detection result.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The above description is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, several modifications and variations can be made without departing from the technical principle of the present invention, and these modifications and variations should also be regarded as the protection scope of the present invention.

Claims

1. A Mars surface three-dimensional terrain detection method based on image and point cloud fusion is characterized by comprising the following steps:

acquiring image data of a region to be detected on the surface of the Mars;

visually outputting a terrain detection result;

the camera set comprises a color camera and a depth camera, and the image data comprises an RGB color image and a depth image;

wherein the training of the three-dimensional target detection network comprises:

performing iterative training on the three-dimensional target detection network based on the three-dimensional detection result;

the image data of the surface of the Mars simulation field is obtained by driving a calibrated camera group to move in the Mars simulation field through a trolley robot;

the cone-shaped point cloud is used as the input of the three-dimensional target detection network, the detection of the Mars simulation field and the output of the three-dimensional detection result comprise:

outputting the space position and the three-dimensional boundary bounding box as a three-dimensional detection result;

the improved PointSIFT network comprises a first PointSIFT module, a first SA module, a second PointSIFT module, a second SA module, a first FP module, a third PointSIFT module, a second FP module, a fourth PointSIFT module and a full connection layer which are connected in sequence;

the first PointSIFT module, the second PointSIFT module, the third PointSIFT module and the fourth PointSIFT module respectively comprise a direction coding unit and a scale sensing unit, and the direction coding unit and the scale sensing unit are used for extracting features in different directions and different scales;

the first SA module and the second SA module are used for down-sampling, and the first FP module and the second FP module are used for up-sampling; and the full connection layer is used for receiving the processed characteristic information.

2. The Mars surface three-dimensional terrain detection method based on image and point cloud fusion as claimed in claim 1, wherein the acquisition of the image data comprises:

3. The Mars surface three-dimensional terrain detection method based on image and point cloud fusion of claim 1, characterized in that the depth image conversion into a three-dimensional point cloud comprises:

4. The Mars surface three-dimensional terrain detection method based on image and point cloud fusion of claim 3, characterized in that the three-dimensional point cloud coordinates are saved in a pcd point cloud format and a two-level bin file.

5. The mars surface three-dimensional terrain detection method based on image and point cloud fusion, as claimed in claim 1, wherein the taking the color graphics dataset as an input of a two-dimensional object detection network and outputting a two-dimensional detection result comprises:

inputting the color image data set into a two-dimensional target detection network Yolov5 to be trained to obtain weights, predicting and identifying a target area by using the trained weight file to obtain the category and the position of the target area in a two-dimensional image, and outputting the category and the position as a two-dimensional detection result.

6. The Mars surface three-dimensional terrain detection method based on image and point cloud fusion as claimed in claim 1, wherein the extracting of the cone-shaped point cloud of the target area in the three-dimensional point cloud by taking the two-dimensional detection result and the three-dimensional point cloud data set as input comprises:

P _rgb ＝RP _ir +T

7. The Mars surface three-dimensional terrain detection method based on image and point cloud fusion as claimed in claim 1,

the preprocessing of the three-dimensional point cloud data comprises format conversion and data normalization;

8. The mars surface three-dimensional terrain detection method based on image and point cloud fusion of claim 1, wherein the size of the RGB color image is 1920x1080 pixels, and the RGB color image comprises three RGB channels; the depth image has a size of 512x424 pixels, each pixel has 16 bits, occupies 2 bytes, and represents depth data by pixel, i.e., an actual distance, in millimeters.