CN113129370B

CN113129370B - Semi-supervised object pose estimation method combining generated data and label-free data

Info

Publication number: CN113129370B
Application number: CN202110241227.5A
Authority: CN
Inventors: 陈启军; 周光亮; 颜熠; 王德明; 刘成菊
Original assignee: Tongji University
Current assignee: Tongji University
Priority date: 2021-03-04
Filing date: 2021-03-04
Publication date: 2022-08-19
Anticipated expiration: 2041-03-04
Also published as: CN113129370A

Abstract

The invention relates to a semi-supervised object pose estimation method combining generated data and label-free data, which comprises the following steps of: 1) generating point cloud data with pose labels, namely generating data; 2) acquiring a color image and a depth image of a target object without a label, inputting the color image into a trained example segmentation network to obtain an example segmentation result, and obtaining point cloud of the target object, namely label-free real data, from the depth image according to the segmentation result; 3) in each training period, performing supervised training on the posture estimation network model by adopting the generated data, and performing self-supervised training on the posture estimation network model by adopting unlabeled real data; 4) and after each training period is finished, calculating the accuracy of the pose estimation network model by adopting partial real data. Compared with the prior art, the method mainly solves the problem that the 6D pose tag is difficult to acquire, and can realize accurate estimation of the pose of the object by only utilizing the synthetic data and the unmarked real data.

Description

Semi-supervised object pose estimation method combining generated data and label-free data

Technical Field

The invention relates to the field of robot vision, in particular to a semi-supervised object pose estimation method combining generated data and label-free data.

Background

The object pose estimation technology based on computer vision is a key technology for realizing grabbing and smart operation of the robot, and has important significance for improving the environment and task adaptability of the robot, widening the application field of the robot and improving the flexibility and application efficiency of the robot in scenes such as intelligent manufacturing, warehouse logistics, home service and the like. In addition, the technology has wide application prospects in the fields of automatic driving, augmented reality, virtual reality and the like.

In recent years, with the vigorous development of deep learning technology, object pose estimation based on deep learning obtains a better effect. Under unstructured scenes such as background clutter, object stack shielding and illumination change, the robustness, accuracy and real-time performance of the deep learning method are superior to those of the traditional pose estimation method. However, the deep learning method is a data-driven algorithm, and needs a large amount of training data with labels to obtain an ideal effect, but in the field of object pose estimation, the 6D labels are difficult to obtain, and time and labor are wasted.

In order to solve the data acquisition problem, there are two main methods at the present stage. One is to artificially synthesize data using a CAD model of the object. However, the domain difference exists between the directly synthesized data and the real data, so that the model trained on the synthesized data has undesirable effect in a real scene. In order to eliminate the domain differences, several methods such as domain randomization, domain adaptation, and highly realistic image generation have been developed, and although these methods achieve certain effects, the effects of models trained using real data have not been achieved. The second category is methods based on self-supervised and semi-supervised learning. Self-supervision and semi-supervision learning are research hotspots in recent years, relatively extensive research is carried out in the fields of image classification, human body pose estimation and the like, only a few attempts are made in the field of object pose estimation, the existing method generates corresponding mask images, color images and depth images in a model rendering mode according to the pose predicted by a network, and the mask images, the color images and the depth images are visually aligned and geometrically aligned with the actual input to serve as a supervision learning target of the network, so that the self-supervision training of the network is realized. Although pose labeling is not needed, the method still needs to be supervised by using the generated color image, the influence of field difference is not eliminated, and the accuracy of the method cannot meet the requirement of practical application.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provide a semi-supervised object pose estimation method combining generated data and label-free data.

The purpose of the invention can be realized by the following technical scheme:

a semi-supervised object pose estimation method combining generated data and label-free data comprises the following steps:

1) generating point cloud data with a pose tag according to a CAD model of a target object, namely generating data;

2) acquiring a color image and a depth image of a target object without a label, inputting the color image into a trained example segmentation network to obtain an example segmentation result, and obtaining point cloud of the target object, namely label-free real data, from the depth image according to the segmentation result;

3) in each training period, performing supervised training on the posture estimation network model by adopting the generated data, and performing self-supervised training on the posture estimation network model by adopting unlabeled real data;

4) and after each training period is finished, calculating the accuracy of the pose estimation network model by adopting partial real data, selecting a final pose estimation network model according to the accuracy, and realizing the pose estimation of the object according to the final pose estimation network model.

In the step 3), in each training period, firstly performing supervised training, and then performing self-supervised training.

In the process of carrying out supervised training on the pose estimation network model by adopting the generated data, converting input point cloud data according to the pose predicted by the pose estimation network model and the real pose on the pose label, and calculating the average distance between the two converted point clouds to be used as a loss function with supervised training.

In the process of carrying out self-supervision training on the pose estimation network model by using label-free real data, converting the model point cloud according to the pose predicted by the pose estimation network model, and calculating the average distance between the converted model point cloud and the actually input real data to form a loss function of the self-supervision training.

The calculation of the average distance between the converted model point cloud and the actually input real data is specifically as follows:

for each point in the actually input real data point cloud, obtaining a point closest to the pose estimation network model in the model point cloud after conversion, forming a closest point set, and then calculating the average distance between the actual point cloud and the corresponding point in the closest point set, namely the average distance between the converted model point cloud and the actually input real data.

The expression of the loss function L of the self-supervised training is:

wherein,

for the ith point in the actually input point cloud,

is the j-th point in the object model point cloud,

and

and respectively predicting a rotation component and a translation component, namely the predicted pose, which are obtained by predicting the pose estimation network model, wherein N is the point number of the input point cloud, and M is the point number of the model point cloud.

In the process of carrying out self-supervision training on the pose estimation network model by adopting label-free real data, carrying out random homogeneous transformation on the input real data to obtain a new point cloud, respectively inputting two point clouds before and after the random homogeneous transformation into the pose estimation network model for pose transformation, respectively calculating two self-supervision loss functions according to predicted poses, and carrying out training together.

In the self-supervision training, the expression of the complete self-supervision loss function is:

subscripts 1 and 2 respectively represent two point clouds before and after random homogeneous transformation.

In the step 4), after each training period is finished, the average distance between the actual point cloud and the conversion model point cloud is calculated on a test set and used as an evaluation index of the accuracy of the pose estimation network model, and the smaller the average distance is, the more accurate the model is.

And after the first training period is finished, taking the calculated average distance as the optimal distance, updating the optimal distance value for the following training period if the calculated average distance is smaller than the optimal distance, abandoning the model trained in the period if the average distance is larger than the optimal distance and the difference value is larger than a set threshold value, and continuing training on the basis of the model in the previous period in the next period.

Compared with the prior art, the invention has the following advantages:

the semi-supervised training method using the generated data and the unmarked real data solves the problem that the existing object pose estimation method based on deep learning depends on large-scale marked real data, and greatly improves the flexibility of pose estimation application.

In the process of self-supervision training, the invention adopts a point cloud transformation strategy and calculates two self-supervision losses to train the network simultaneously, thereby effectively preventing the influence of the mis-alignment between the point clouds on the network training.

In each training period, the method and the device sequentially utilize the generated data to perform supervised training and utilize the real data to perform self-supervised training.

Drawings

Fig. 1 is a flowchart of a semi-supervised pose estimation method of the present invention.

Fig. 2 is a partial pose estimation result diagram.

Detailed Description

The invention is described in detail below with reference to the figures and specific embodiments. The present embodiment is implemented on the premise of the technical solution of the present invention, and a detailed implementation manner and a specific operation process are given, but the scope of the present invention is not limited to the following embodiments.

The embodiment provides a semi-supervised object pose estimation method based on generated data and unmarked real data, a frame schematic diagram of the method is shown in fig. 1, and the method specifically comprises the following steps:

s1, generating object point cloud data with pose labels by using the CAD model of the object;

s2, acquiring a color image and a depth image of the target object without a label, inputting the color image into a trained example segmentation network to obtain an example segmentation result, and acquiring a point cloud of the target object from the depth image according to the segmentation result;

and S3, in each training period, performing supervised training on the posture estimation network by using the generated data with labels, and performing self-supervised training on the network by using the real data without labels.

And S4, finishing each training period, calculating the accuracy of the pose estimation model by using partial real data, and selecting a final network model according to the accuracy.

In the implementation process of step S1, OpenGL is used to render and output the object CAD model, so as to obtain object point cloud data with labels at different poses.

In the implementation process of step S2, firstly, a real color image training example with 2D mask labels is used to segment the network; and then, segmenting the real color image by utilizing a trained segmentation network, converting an object part in the depth map into an object point cloud by combining camera internal parameters according to a segmentation result, and taking the object point cloud as the input of a next pose estimation network.

In the implementation process of step S3, in each training period, the pose estimation network is supervised by using the generated data, and then the network is self-supervised by using the unlabeled real data, as shown in fig. 1. When supervised training is carried out by using the generated data with the labels, the network predicted pose and the real pose are respectively utilized to convert the object model point clouds, and the average distance between the two conversion model point clouds is calculated and used as a loss function of the supervised training. When the real data is used for carrying out self-supervision training on the network, the network predicted pose is utilized to convert the object model point cloud, and then the average distance between the conversion model point cloud and the actual input point cloud is calculated, so that a loss function of the self-supervision training is formed. The average distance between the transformed model point cloud and the actual input point cloud is calculated as follows:

for each point in the actual input point cloud, finding a point closest to the point in the space distance in the conversion model point cloud to form a closest point set; and then calculating the average distance between the actual point cloud and the corresponding point in the closest point set as the average distance between the conversion model point cloud and the actual input point cloud. The calculation formula is as follows:

wherein,

is the ith point in the input object point cloud,

is the j-th point in the object model point cloud,

and

respectively, the rotation component and the translation component of the network prediction, wherein N is the point number of the input point cloud, and M is the point number of the model point cloud.

In addition, in the process of self-supervision training, firstly, input point clouds are subjected to random homogeneous transformation to obtain a new point cloud, the two point clouds are respectively input into a network to be subjected to pose transformation, two self-supervision losses are calculated according to the poses respectively predicted, and the network is trained together. The complete auto-supervision loss function is given by:

in the implementation process of step S4, after each training period is finished, the average distance between the actual point cloud and the converted model point cloud is calculated on a test set, and the calculated average distance is used as an evaluation index of model accuracy. The smaller the average distance, the more accurate the model is considered. And in the first training period, taking the calculated average distance as the optimal distance. In the subsequent training period, if the average distance is smaller than the optimal distance, the optimal distance value is updated; if the average distance is larger than the optimal average distance and the difference value is larger than the set threshold value, the model trained in the period is abandoned, and the next period continues to train on the basis of the model in the previous period.

In summary, compared with the pose estimation method in the prior art, the greatest innovation points of the invention include the following three points:

according to the invention, the pose estimation network can be trained only by generating data and unmarked real data, so that the problem that the 6D pose tag is difficult to acquire in the existing method is solved, and the flexibility of pose estimation application is greatly improved.

The invention provides an object pose estimation self-supervision training method based on point cloud, which realizes the self-supervision of a network by aligning the transformation model point cloud and the actual object point cloud; and a point cloud transformation strategy is provided, two self-supervision losses are calculated at the same time to train the network, and the influence of the mis-alignment between the point clouds on the network training is effectively prevented.

The invention provides a semi-supervised training method, namely, in each training period, the generated data are sequentially utilized for carrying out supervised training, and the real data are utilized for carrying out self-supervised training. The result of the partial pose estimation is shown in fig. 2.

The foregoing detailed description of the preferred embodiments of the invention has been presented. It should be understood that numerous modifications and variations could be devised by those skilled in the art in light of the present teachings without departing from the inventive concepts. Therefore, the technical solutions available to those skilled in the art through logic analysis, reasoning and limited experiments based on the prior art according to the concept of the present invention should be within the scope of protection defined by the claims.

Claims

1. A semi-supervised object pose estimation method combining generated data and label-free data is characterized by comprising the following steps of:

3) in each training period, carrying out supervised training on the generated data alignment posture estimation network model, carrying out self-supervised training on the alignment posture estimation network model by using unlabeled real data, in each training period, carrying out supervised training firstly, and then carrying out self-supervised training, in the process of carrying out self-supervised training on the alignment posture estimation network model by using the unlabeled real data, converting model point clouds according to poses predicted by the pose estimation network model, calculating an average distance between the converted model point clouds and actually input real data to form a loss function of the self-supervised training, and calculating the average distance between the converted model point clouds and the actually input real data specifically as follows:

for each point in the actually input real data point cloud, acquiring a point closest to the pose estimation network model in the model point cloud after conversion to form a closest point set, and then calculating the average distance between the actual point cloud and a corresponding point in the closest point set, namely the average distance between the converted model point cloud and the actually input real data;

the expression of the loss function L of the self-supervised training is:

wherein,

for the ith point in the actually input point cloud,

is the j-th point in the object model point cloud,

and

respectively predicting a rotary component and a translational component, namely the pose, which are obtained by the pose estimation network model, wherein N is the point number of the input point cloud, and M is the point number of the model point cloud;

in the process of self-supervision training of the pose estimation network model by using label-free real data, carrying out random homogeneous transformation on input real data to obtain a new point cloud, respectively inputting two point clouds before and after the random homogeneous transformation into the pose estimation network model for pose transformation, respectively calculating two self-supervision loss functions according to predicted poses, and carrying out training together, wherein in the self-supervision training, the expression of a complete self-supervision loss function is as follows:

wherein, subscripts 1 and 2 respectively and correspondingly represent two point clouds before and after random homogeneous transformation;

2. The semi-supervised object pose estimation method combining generated data and unlabeled data according to claim 1, wherein in the supervised training process of the pose estimation network model using the generated data, the input point cloud data are transformed according to the pose predicted by the pose estimation network model and the real pose on the pose tag, respectively, and the average distance between the two transformed point clouds is calculated as a loss function of the supervised training.

3. The semi-supervised object pose estimation method combining generation data and label-free data as claimed in claim 1, wherein in the step 4), after each training period is finished, an average distance between an actual point cloud and a conversion model point cloud is calculated on a test set and is used as an evaluation index of pose estimation network model accuracy, and the smaller the average distance is, the more accurate the model is.

4. The semi-supervised object pose estimation method combining generated data and unlabeled data according to claim 3, wherein after a first training period is finished, the calculated average distance is used as an optimal distance, for a subsequent training period, if the calculated average distance is smaller than the optimal distance, the optimal distance value is updated, if the average distance is larger than the optimal distance and the difference value is larger than a set threshold value, the model trained in the period is discarded, and training is continued on the basis of the model in the previous period in the next period.