CN112634367A

CN112634367A - Anti-occlusion object pose estimation method based on deep neural network

Info

Publication number: CN112634367A
Application number: CN202011562092.4A
Authority: CN
Inventors: 杨嘉琛; 奚萌
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2020-12-25
Filing date: 2020-12-25
Publication date: 2021-04-09

Abstract

The invention provides an anti-occlusion object pose estimation method based on a deep neural network, which comprises the following steps of: a training set picture database with labels and a test set picture database are automatically constructed by using 3D modeling software; constructing a deep neural network: the neural network comprises four sub-branch networks, and each sub-branch is an independent convolutional neural network structure; constructing a network prediction output processing algorithm: the four sub-branch networks output a predicted value of 6 dimensions at the last layer of dense connection layer, which represents the pose information of the object to be estimated; due to the existence of shielding, the result of a certain branch has an error, abnormal values exist in the output of 4 branches, 5 algorithms are constructed to optimize the abnormal values, and the shielding interference resistance is improved; training a deep neural network model, namely completing the training of the deep neural network by using samples in a training set; and testing the deep neural network model by using different shielding ratio test sets.

Description

Anti-occlusion object pose estimation method based on deep neural network

Technical Field

The invention belongs to the field of object pose estimation, and relates to a method for estimating an object pose with strong anti-interference performance by using a deep neural network.

Background

The object pose covers all spatial information of the object, including position information and posture information of the object. In many fields such as modern industrial production and life, the pose information of the object has very important significance and plays a role of putting a great deal of weight. Accurate estimation of pose information of objects is the basis for many current industrial applications. For example, in the field of robots, accurate acquisition of target position and posture information is a main task of robot vision and is also a basis of other operation tasks such as follow-up grabbing. In the field of automatic driving of the Internet of things, accurate estimation of the pose of an obstacle is a precondition and guarantee for safe driving. Therefore, the method has very important significance in accurately and quickly estimating the pose of the object.

Convolutional Neural Networks (CNNs) have significant advantages in processing image information compared to many deep-learning neural networks, such as Deep Belief Networks (DBNs), Recurrent Neural Networks (RNNs), and are therefore commonly used to process image-related applications. The convolution kernel extracts information on the image by sliding on a feature map (feature map). The shallow feature map can acquire visual information such as image texture contours, the deep feature map can acquire relatively abstract semantic information, and regional information of features on the image can be integrated. The unique convolution learning mode greatly reduces the use of parameters in the network and obviously improves the training speed and the convergence speed of the network model. And the convolutional neural network obtains good effects in the applications of image classification, target detection, pattern recognition and the like. Therefore, in recent years, the convolutional neural network is further applied to the field of object pose estimation.

The rise of the computer vision technology and the deep neural network greatly simplifies the process of estimating the object pose, and overcomes the difficulties and the defects of complex equipment and complex process in the traditional technical scheme. But due to the characteristics of the deep neural network technology, the deep neural network technology also has inherent defects and shortcomings. The accuracy and effect of the network depend on a large-scale training set and also depend on the consistency of the training samples and the test samples. Good estimation effect can be achieved only when the similarity between the test sample and the training sample is high. However, in actual industrial production and industrial application, due to the ubiquitous existence of factors such as shielding, noise and illumination, good consistency of the test sample and the training sample cannot be guaranteed, and the performance of the trained network model is remarkably reduced when interference is shielded.

In the field of applied target detection of convolutional neural network, shielding is an important interference factor influencing the detection effect^[1]. Under the condition that the shielding exists, the difference between the training sample and the testing sample is large, and the difference cannot be intelligently identified and judged, so that large errors are caused, and the performance of the neural network is reduced. Occlusion can be generally divided into two types, one is the mutual occlusion of two objects, and extensive research has been conducted by numerous scholars regarding this interference, and numerous solutions have been proposed. The other is that the target is blocked by the interferent, which is more common and common in industrial application, but at present, only increasing the number and diversity of samples can be adopted to overcome the effect of blocking interference reduction, and no effective solution is available.

Therefore, the method for estimating the pose of the object with the anti-shielding interference capability has very important industrial value and application value. The convolutional neural network has extremely excellent data characterization capability^[2]。

The related documents are:

[1] chua Xingxian, Zhao and Peng, Qipeng, etc. the shielded target detection and identification technology researches [ J ] digital technology and application, 2013(9):73-75.

[2] Liu dong, plum, Cao Shi Dong, deep learning and its application in image object classification and detection are reviewed [ J ] computer science 2016, (12):13-23.

Disclosure of Invention

The invention provides an anti-occlusion object pose estimation method based on a deep neural network, aiming at the problem of occlusion interference in the field of object pose estimation. The invention uses a monocular vision system, and the pose information of the object to be estimated is output and utilized end to end by inputting the image of the object to be estimated. The use of the deep convolutional neural network ensures the rapidity and the real-time performance of estimation, and the accuracy and the strong anti-interference performance are ensured by a network prediction value processing algorithm. The technical scheme is as follows:

an anti-occlusion object pose estimation method based on a deep neural network comprises the following steps:

firstly, a training set picture database and a testing set picture database with labels are automatically constructed by using 3D modeling software.

(1) Constructing a cylindrical regular object as a target to be estimated, and using the checkerboard icon as an identifier;

(2) placing the target to be estimated in front of a target camera, wherein the marker is positioned in the middle of the camera view, and the center of the target to be estimated, the marker and the center of a camera lens are positioned on the same horizontal central line and are used as reference positions;

(3) moving and rotating an object to be estimated according to the script, changing the spatial position and posture, capturing marker photos under corresponding postures, and taking corresponding six-dimensional coordinates as labels of the training samples;

(4) obtaining a plurality of photos in batch as training set samples, and carrying out required data format processing on the tags of the photos to meet the requirement of network input;

(5) constructing a blocking test set in the same manner except that the markers are blocked in different percentage ratios;

secondly, constructing a deep neural network: the neural network comprises four sub-branch networks, wherein each sub-branch is an independent convolutional neural network structure: consists of 6 coiling layers, 4 maximum pooling layers, 1 flattening layer and 3 dense connecting layers; wherein each two layers or one layer is followed by one maximum pooling layer, the number of convolution kernels is 32,32,64, 128,256 in order, and the parameters of the dense connection layer are 2048, 6 in order.

Thirdly, constructing a network prediction output processing algorithm: the four sub-branch networks output a predicted value of 6 dimensions at the last layer of dense connection layer, which represents the pose information of the object to be estimated; due to the existence of shielding, the result of a certain branch has an error, abnormal values exist in the output of 4 branches, 5 algorithms are constructed to optimize the abnormal values, and the shielding interference resistance is improved;

(1) weighted average method: the 4 branches respectively output 6-dimensional predicted values of the target to be estimated, the 6 dimensional predicted values are respectively and correspondingly added to solve weighted average, and the 6-dimensional predicted values are taken as final output;

(2) euclidean distance method: considering the estimates of the 6 dimensions separately, for each dimension: calculating the average Euclidean distance of each branch and the other 3 branches in the dimension; setting an average Euclidean distance threshold; calculating the difference value between each branch and the average Euclidean distance, and when the difference value is larger than the average Euclidean distance threshold value, judging that the predicted value of the branch in the current dimension is abnormal, and deleting the branch; averaging the predicted values of the residual branches to be used as the predicted output of the current dimension; performing the operations in 6 dimensions respectively, and finally outputting predicted values of 6 dimensions;

(3) dot group density method: considering the estimates of the 6 dimensions separately, for each dimension: calculating the point group density of each branch and other 3 branches on the dimension, wherein the point group density is related to the distance, the larger the distance is, the smaller the point group density is, and the point group density is represented by the reciprocal of the Euclidean distance; setting a point group density threshold, calculating the difference value between the point group density of each branch and the other 3 branches in the dimension and the average point group density, and when the difference value on a certain branch is smaller than the preset point group density threshold, judging that the predicted value of the branch in the current dimension is abnormal and deleting the branch; averaging the predicted values of the residual branches to serve as the predicted output of the current dimension, performing the operation in 6 dimensions respectively, and finally outputting the predicted values of 6 dimensions;

(4) joint euclidean distance method: the 4 branches respectively output predicted values of 6 dimensions, and the predicted values of the 6 dimensions are correlated with each other in the presence or absence of abnormality, so that the mutual linkage influence among the 6 dimensions is considered in a combined manner, the Euclidean distance is used as a judgment standard, and in the first step, for each dimension: calculating the average Euclidean distance of each branch and the other 3 branches in the dimension; second, for each branch: carrying out weighted average on the average Euclidean distances of 6 dimensions to obtain a confidence coefficient of the branch; thirdly, sequencing the confidences of the 4 branches, finding out the branch with the lowest confidence, namely the branch with the largest weighted average Euclidean distance in the second step, and excluding the branch; fourthly, carrying out weighted average on the remaining 3 branches by using a weighted average method of the algorithm (1) and outputting predicted values of 6 dimensions;

(5) joint point group density method: the 4 branches respectively output predicted values of 6 dimensions, and the predicted values of the 6 dimensions are correlated with each other in the presence or absence of abnormality, so that the mutual linkage influence among the 6 dimensions is jointly considered, and the point group density is used as a judgment standard; first, for each dimension: calculating the point group density of each branch and other 3 branches in the dimension; second, for each branch: carrying out weighted average on the density values of the point groups with 6 dimensions to obtain a confidence coefficient of the branch; thirdly, sequencing the confidence degrees of the 4 branches, finding out the branch with the lowest confidence degree, namely the point group density minimum branch weighted and averaged in the second step, and excluding the branch; fourthly, carrying out weighted average on the remaining 3 branches by using a weighted average method of the algorithm (1) and outputting predicted values of 6 dimensions;

thirdly, training a deep neural network model, namely completing the training of the deep neural network by using samples in a training set;

the fourth step: and testing the deep neural network model by using different shielding ratio test sets.

The invention designs a strong anti-interference object pose estimation method based on a deep neural network by utilizing a convolutional neural network and 5 abnormal value detection algorithms. The method takes a picture of an unshielded marker as a training set for training a deep convolutional neural network model, and takes a picture of a part of the unshielded marker as a testing set for testing the anti-interference capability of the deep convolutional neural network model. Compared with the prior art, the method has stronger anti-interference performance, and the estimation accuracy of the object pose under the shielding condition is greatly improved.

Drawings

FIG. 1 is a flow chart of strong anti-interference object pose estimation based on a deep neural network

FIG. 2 Euclidean distance algorithm flow chart

FIG. 3 is a flow chart of a point group density algorithm

FIG. 4 is a flow chart of a joint Euclidean distance algorithm

FIG. 5 flow chart of the joint point group density algorithm

FIG. 6 comparison of effects under different test sets

FIG. 7 is a comparison graph of attitude estimation effects of 5 algorithms under different test sets

FIG. 8 is a comparison graph of position estimation effects of 5 algorithms under different test sets

Detailed Description

The invention adopts a frame of a convolutional neural network to complete the task of mapping and learning of the corresponding relation between feature extraction and vision, and 5 different mathematical algorithms are constructed on the basis of network prediction output to process predicted values, so that the prediction capability in the presence of shielding interference is improved. The scheme can greatly simplify the complexity of object pose estimation, omits image processing processes such as feature extraction, feature matching and the like, and realizes end-to-end estimation. Compared with the prior art, the method has the advantages that the output algorithm is used for processing the predicted value output by the network, the pose estimation precision and the shielding interference resistance are further improved, and the object pose estimation is more convenient, rapid, accurate and efficient

In order to make the technical scheme of the invention clearer, the invention is further explained below by combining the attached drawings. The invention is realized by the following steps:

(1) And constructing a cylindrical regular object with the radius of 100mm and the height of 200mm as a target to be estimated, and taking the checkerboard grid icon as an identifier.

(2) And (3) placing the object to be estimated at a position 0.5m in front of the target camera, wherein the marker is positioned in the middle of the camera view, and the center of the cylindrical object, the marker and the camera lens are positioned on the same horizontal central line and are used as reference positions.

(3) And moving and rotating the object to be estimated according to the script, changing the spatial position and posture, capturing the marker photo under the corresponding posture, and taking the corresponding six-dimensional coordinate as the label of the training sample.

(4) 50000 photos are obtained in batches to be used as training set samples, and required data format processing is carried out on the labels of the photos, so that the requirement of network input is met.

(5) Occlusion test sets were constructed in the same manner except that the markers were occluded at a rate of 0%, 4%, 9%, 12%, 16%, 20%, 25%.

And secondly, constructing a deep neural network. The neural network comprises four sub-branch networks, wherein each sub-branch is an independent convolutional neural network structure: consists of 6 convolution layers, 4 maximum pooling layers, 1 flattening layer and 3 dense connecting layers. Wherein each two layers or one layer is followed by one maximum pooling layer, the number of convolution kernels is 32,32,64, 128,256 in order, and the parameters of the dense connection layer are 2048, 6 in order.

And thirdly, constructing a network prediction output processing algorithm. The four sub-branch networks output 6-dimensional estimated values at the last dense connection layer and represent the pose information of the object to be estimated. Due to the existence of shielding, the result of a certain branch has an error, abnormal values exist in 4 outputs, 5 algorithms are constructed to optimize the results, and the anti-interference capability is improved.

N-4, i.e., four branches; n is 1,2,3,4, i.e. the specific branch; 1, 2., 6, i.e., 6 degrees of freedom, and p (i) represents a predicted value of the i-th degree of freedom. The predicted value of each dimension is analyzed independently by a weighted average method, an Euclidean distance method and a point group density method, and the predicted value of 6 dimensions is analyzed jointly by a combined Euclidean distance method and a combined point group density method.

Weighted average method: the 4 branches respectively output 6-dimensional predicted values of the target to be estimated, the network output dimension at the moment is 4 x 6, the 6 dimensional predicted values are respectively and correspondingly added to calculate weighted average, and the 6-dimensional predicted values are taken as final output. When the region corresponding to a certain branch is shielded, the error of the predicted value of the branch is large, and the error can be weakened and reduced by accurately predicting the branch through a weighted average method.

Equation (1) represents the average of the ith degree of freedom, and equation (2) represents the weighted average of the ith degree of freedom. Where s (i) is a weighting factor.

Euclidean distance method: considering the estimates of the 6 dimensions separately, for each dimension: calculating the average Euclidean distance of each branch and the other 3 branches in the dimension; setting an average Euclidean distance threshold; calculating the difference value between each branch and the average Euclidean distance, and when the difference value is greater than the average Euclidean distance threshold value, judging that the predicted value of the branch in the current dimension is abnormal, and deleting the predicted value; and averaging the predicted values of the residual branches to be used as the predicted output of the current dimension. And respectively carrying out the operations in 6 dimensions, and finally outputting a predicted value of 6 dimensions. The algorithm flow is shown in fig. 2. Equation (3) calculates the distance of the nth branch from the other N-1 branches in dimension i, and equation (4) calculates the average distance of the 4 branch dimensions i. The Euclidean distance threshold is set to 0.2, and when the distance dis (n, i) and the average distance dis (i) of a certain branch exceed the threshold, the estimated value is judged to be an abnormal value, and the predicted value is excluded.

Dot group density method: considering the estimates of the 6 dimensions separately, for each dimension: calculating the point group density of each branch and other 3 branches on the dimension, wherein the point group density is related to the distance, the larger the distance is, the smaller the point group density is, and the point group density is represented by the reciprocal of the Euclidean distance; setting average point group density, calculating the difference value between the point group density and the average point group density of each branch and other 3 branches in the dimension, and when the difference value on a certain branch is smaller than a preset point group density threshold value, judging that the predicted value of the branch in the current dimension is abnormal, and deleting the branch; and averaging the predicted values of the residual branches to be used as the predicted output of the current dimension, performing the operation in 6 dimensions respectively, and finally outputting the predicted values of 6 dimensions.

The algorithm flow is shown in fig. 3. From a geometric analysis, the outlier point cluster density is small. Outliers are excluded by comparing the dot cluster densities for each dimension. Equation (5) calculates the point cloud density of the nth branch, and equation (6) finds the branch where the minimum point cloud density is located.

Since there is a correlation between 6 degrees of freedom of an object, the 6-dimensional mutual linkage influence is considered in combination, and the distance between the jth branch and the kth branch can be expressed by B (j, k) shown in formula (7) as a whole, with a dimension of 6.

Joint euclidean distance method: the 4 branches respectively output predicted values of 6 dimensions, and the predicted values of the 6 dimensions are correlated with each other in the presence or absence of abnormality, so that the mutual linkage influence among the 6 dimensions is considered in a combined manner, the Euclidean distance is used as a judgment standard, and in the first step, for each dimension: the average euclidean distance in this dimension is calculated for each branch and the other 3 branches. Second, for each branch: the average euclidean distance of the 6 dimensions is weighted averaged to obtain a confidence for the branch. And thirdly, sequencing the confidences of the 4 branches, finding out the branch with the lowest confidence, namely the branch with the largest weighted average Euclidean distance in the second step, and excluding the branch. And fourthly, carrying out weighted average on the rest 3 branches by using a weighted average method of the algorithm (1) and outputting predicted values with 6 dimensions. The algorithm flow chart is shown in fig. 4. And (4) determining an abnormal value by using a 6-dimensional combined Euclidean distance as a criterion according to an Euclidean distance algorithm. Equation (8) represents the i-dimensional euclidean distance of the nth branch. Equation (9) weights the different dimensions to obtain the joint euclidean distance for the nth branch. The abnormal value having the largest euclidean distance is calculated in formula (10).

Joint point group density method: the 4 branches respectively output predicted values of 6 dimensions, and the predicted values of the 6 dimensions are correlated with each other in the presence or absence of abnormality, so that the mutual linkage influence among the 6 dimensions is jointly considered, and the point group density is used as a judgment standard. First, for each dimension: the point cluster density in this dimension is calculated for each branch and the other 3 branches. Second, for each branch: and carrying out weighted average on the density values of the point groups of 6 dimensions to obtain a confidence coefficient of the branch. And thirdly, sequencing the confidences of the 4 branches, finding out the branch with the lowest confidence, namely the point group density minimum branch weighted and averaged in the second step, and excluding the branch. And fourthly, carrying out weighted average on the rest 3 branches by using a weighted average method of the algorithm (1) and outputting predicted values with 6 dimensions. The algorithm flow chart is shown in fig. 5. And (4) considering the mutual linkage influence among the 6 dimensions jointly, and judging an abnormal value by taking the joint point cluster density as a criterion. Equation (11) calculates the point cluster density of the ith dimension of the nth branch. Equation (12) calculates the 6-dimensional weighted joint point cluster density for the nth branch. Equation (13) calculates the minimum joint point cluster density outlier

And thirdly, training a deep neural network model. And completing the training of the deep neural network by using the samples in the training set. The specific training parameters are as follows:

(1) each epoch randomly selects 3000 pictures from the training set as the training samples of the current round;

(2) mini _ batch ═ 2: 2 pictures are input into the network at a time, and the loss of the two pictures is reduced by the aim of back propagation

(3) nb _ epoch ═ 6: each epoch is repeated 6 times before proceeding to the next epoch.

(4) One epoch will be saved 600, one h5 file, the currently trained network model. The next round will be trained on the current basis, i.e. again extracting 3000 training pictures.

(5) Random gradient descent counter-propagating reduces loss.

The fourth step: the deep neural network model was tested using different occlusion ratio test sets, and the network prediction output values were processed using 5 algorithms. FIG. 6 compares the pose estimation effects of a single network SBN without a network predicted value processing algorithm and a 4-branch network MBN-4 using a combined Euclidean distance algorithm under different occlusion proportions, and it can be seen that when the occlusion proportion of the MBN-4 increases, higher estimation accuracy can still be ensured. Fig. 7 compares the average effect of 5 algorithms on object pose across all test sets. Figure 8 compares the average effect of 5 algorithms on object position across all test sets.

Claims

1. An anti-occlusion object pose estimation method based on a deep neural network comprises the following steps:

(5) constructing a shielding test set in the same manner except that the markers are shielded at a rate of 0%, 4%, 9%, 12%, 16%, 20%, 25%;