CN115131437A

CN115131437A - Pose estimation method, and training method, device, equipment and medium of relevant model

Info

Publication number: CN115131437A
Application number: CN202210823003.XA
Authority: CN
Inventors: 周晓巍; 林浩通; 彭思达
Original assignee: Zhejiang Shangtang Technology Development Co Ltd
Current assignee: Zhejiang Shangtang Technology Development Co Ltd
Priority date: 2022-07-12
Filing date: 2022-07-12
Publication date: 2022-09-30
Also published as: WO2024012333A1

Abstract

The application discloses a pose estimation method, a training method, a device, equipment and a medium of a relevant model, and the training method of the pose estimation model comprises the following steps: obtaining a sample image containing an object to be positioned, wherein the sample image contains a sample color image and a sample depth image corresponding to the sample color image; processing the sample color image by using a pose estimation model to obtain a sample initial pose of an object to be positioned; optimizing the initial pose of the sample based on the depth information of the object to be positioned in the sample depth image to obtain the optimized pose of the object to be positioned; and adjusting network parameters in the pose estimation model based on the difference between the optimized pose and the sample initial pose. According to the scheme, the efficiency of model training can be improved.

Description

Pose estimation method, and training method, device, equipment and medium of correlation model

Technical Field

The application relates to the technical field of artificial intelligence, in particular to a pose estimation method and a training method, a device, equipment and a medium of a relevant model.

Background

With the development of science and technology, a camera can be used to shoot an image of an object to be positioned, and then the shot image is processed by using a network model to obtain the position of the object. At present, a large number of sample images with labels are required to be used for training the network model, the workload of the process of labeling the sample images is huge, time is consumed, the time spent in the whole training process is long, and the training efficiency is low.

Disclosure of Invention

The application at least provides a pose estimation method and a training method, a device, equipment and a medium of a relevant model.

The application provides a method for training a pose estimation model, which comprises the following steps: obtaining a sample image containing an object to be positioned, wherein the sample image contains a sample color image and a sample depth image corresponding to the sample color image; processing the sample color image by using a pose estimation model to obtain a sample initial pose of an object to be positioned; optimizing the initial pose of the sample based on the depth information of the object to be positioned in the sample depth image to obtain the optimized pose of the object to be positioned; and adjusting network parameters in the pose estimation model based on the difference between the optimized pose and the sample initial pose.

Therefore, after the sample color image is processed by using the pose estimation model to obtain the sample initial pose of the object to be positioned, the sample initial pose is optimized by using the sample depth image, so that the optimized pose of the object to be positioned is more accurate. And by using the difference between the optimized pose and the sample pose, network parameters in the pose estimation model are adjusted, a sample color image is not required to be labeled, the labeling workload is reduced, and the training efficiency of the pose estimation model is improved.

The method comprises the following steps of optimizing an initial pose of a sample based on depth information of an object to be positioned in a sample depth image to obtain an optimized pose of the object to be positioned, and comprises the following steps: determining a rendering depth map of the object to be positioned based on the sample initial pose and a preset three-dimensional model corresponding to the object to be positioned; determining an optimization term by using the difference between the rendered depth map and the sample depth image; and adjusting the initial pose of the sample to enable the optimization item to meet the preset requirement, and taking the adjusted initial pose of the sample as the optimization pose.

Therefore, a rendering depth map of the object to be positioned is determined based on the sample initial pose and a preset three-dimensional model corresponding to the object to be positioned, then an optimization term is constructed based on the difference between the rendering depth map and the sample depth image, and the sample initial pose is adjusted by the optimization term, so that the adjusted sample initial pose is more accurate.

Wherein the preset requirement is that the optimization item is minimized; and/or, the method further comprises: determining a normal map of the object to be positioned based on the sample initial pose and a preset three-dimensional model; and determining an optimization term using a difference between the rendered depth map and the sample depth image, comprising: respectively carrying out back projection on the rendering depth map and the sample depth image to obtain a first point cloud corresponding to the rendering depth map and a second point cloud corresponding to the sample depth image, wherein the first point cloud comprises first three-dimensional points corresponding to a plurality of object pixel points, the second point cloud comprises second three-dimensional points corresponding to each object pixel point, and the object pixel points are pixel points belonging to an object to be positioned in the sample color image; for each object pixel point, determining a deviation characterization value corresponding to the object pixel point, wherein the deviation characterization is the product of a target pose difference corresponding to the object pixel point and a normal direction corresponding to the object pixel point in a normal map, and the target pose difference is the pose difference between a first three-dimensional point corresponding to the object pixel point and a second three-dimensional point corresponding to the object pixel point; and determining an optimization item by combining the corresponding deviation characterization values of the pixel points of each object.

Therefore, the first point cloud and the second point cloud are obtained by rendering the depth map and the sample depth image through back projection, and the normal directions of all points are gathered based on the difference between the three-dimensional points in the first point cloud and the three-dimensional points in the second point cloud, so that the determined deviation characterization value is more accurate.

Before adjusting network parameters in the pose estimation model based on the difference between the optimized pose and the sample initial pose, the method further comprises: judging whether the optimized pose is a preset error estimation pose or not; in response to the optimized pose not being the preset erroneous estimation pose, performing a step of adjusting network parameters in the pose estimation model based on a difference between the optimized pose and the sample initial pose.

Therefore, under the condition that the optimized pose is not the preset error estimation pose, the difference between the optimized pose and the sample initial pose is used, network parameters in the pose estimation model are adjusted more reasonably, and disturbance of error estimation on the pose estimation model can be reduced.

Wherein, judging whether the optimized pose is a preset error estimation pose comprises: acquiring a centralized trend characterization value between deviation characterization values corresponding to each object pixel point, wherein the object pixel points are pixel points belonging to an object to be positioned in a sample color image, the deviation characterization values corresponding to the object pixel points are the product of a target pose difference corresponding to the object pixel points and a normal direction corresponding to the object pixel points, the target pose difference is a pose difference between a first three-dimensional point corresponding to the object pixel points and a corresponding second three-dimensional point, the first three-dimensional point is a three-dimensional point in a first point cloud corresponding to a rendering depth map, and the second three-dimensional point is a three-dimensional point in a second point cloud corresponding to the sample depth image; judging whether the centralized trend representation value is smaller than or equal to a preset size, wherein the preset size is related to the size of the object to be positioned in the physical world; and determining that the optimized pose is not a preset error estimation pose in response to the central tendency characterization value being less than or equal to a preset size.

Therefore, by considering that the optimized pose is not the preset erroneous estimation pose in the case where the central tendency characterization value is not larger than the preset size, the optimized pose can be filtered based on the physical size of the object to be positioned.

The method comprises the following steps of processing a sample color image by using a pose estimation model to obtain a sample initial pose of an object to be positioned, wherein the method comprises the following steps: determining projection positions of a plurality of three-dimensional key points of an object to be positioned on a sample color image by using a pose estimation model; and determining the sample initial pose of the object to be positioned based on the projection position of each three-dimensional key point on the sample color image and the internal parameters of the target camera.

Therefore, the projection position of the three-dimensional key point of the object to be positioned on the sample color image can be determined through the pose estimation model, and the sample initial pose of the object to be positioned is obtained according to the determined projection position of the three-dimensional key point and the internal parameters of the target camera.

The method for determining the projection positions of a plurality of three-dimensional key points of an object to be positioned on a sample color image by using a pose estimation model comprises the following steps: predicting a direction vector from each object pixel point to each projection position by using a pose estimation model, wherein the object pixel points are pixel points belonging to an object to be positioned in a sample color image; for each projection position, determining a preset number of direction vectors from a plurality of direction vectors corresponding to the projection position, and generating candidate projection positions corresponding to the direction vectors; determining the scores of the candidate projection positions based on the position relation among the candidate projection positions; and taking the candidate projection position with the score meeting the preset requirement as the projection position.

Therefore, a plurality of candidate projection positions are determined based on the direction vector of each object pixel point relative to the projection position, and then the candidate projection position meeting the requirement is selected as the final projection position, so that the determined projection position is more accurate.

Wherein, confirm the direction vector of the predetermined quantity in a plurality of direction vectors that correspond with the projection position, generate the candidate projection position that each direction vector corresponds, include: summing the position of each object pixel point and the direction vector corresponding to the object pixel point to obtain a candidate projection position corresponding to each object pixel point; determining the score of each candidate projection position based on the position relation among the candidate projection positions, comprising: for each candidate projection position, determining the number of target distances between the candidate projection position and other candidate projection positions, and taking the number of the target distances as a score, wherein the target distance is a distance less than or equal to a preset distance; taking the candidate projection position with the score meeting the preset requirement as the projection position, including: and taking the candidate projection position corresponding to the maximum score as the projection position.

Therefore, the final projection position is determined and obtained by determining the distance between the candidate projection positions, so that the determined projection position is more accurate.

The method comprises the following steps of processing a sample color image by using a pose estimation model to obtain a sample initial pose of an object to be positioned, wherein the method comprises the following steps: performing target detection on the sample color image by using a pose estimation model to obtain the position of an object to be positioned; based on the position of the object to be positioned, cutting the sample color image to obtain a local image containing the object to be positioned; and processing the local image to obtain the sample initial pose of the object to be positioned.

Therefore, the sample color image is firstly subjected to target detection to obtain the position of the object to be positioned, then the sample color image is cut to obtain a local image containing the object to be positioned, and the local image is processed to obtain the sample initial pose of the object to be positioned, so that the interference of the background can be reduced, and the accuracy of the sample initial pose can be improved.

The application provides a pose estimation method, which comprises the following steps: acquiring a target image containing an object to be positioned, wherein the target image comprises a target color image and a target depth image corresponding to the target color image; processing the target color image by using a pose estimation model to obtain a target initial pose of an object to be positioned; optimizing the initial pose of the target based on the depth information of the object to be positioned in the target depth image to obtain the target pose of the object to be positioned; the image detection model is obtained by training by using the pose estimation model training method.

Therefore, after the target color image is processed by using the pose estimation model to obtain the target initial pose of the object to be positioned, the target initial pose is optimized by using the target depth image, so that the optimized target pose of the object to be positioned is more accurate.

The application provides a position appearance estimation model's trainer, includes: the system comprises a sample image acquisition module, a positioning module and a positioning module, wherein the sample image acquisition module is used for acquiring a sample image containing an object to be positioned, and the sample image contains a sample color image and a sample depth image corresponding to the sample color image; the sample pose estimation module is used for processing the sample color image by using the pose estimation model to obtain the sample initial pose of the object to be positioned; the sample pose optimization module is used for optimizing the initial pose of the sample based on the depth information of the object to be positioned in the sample depth image to obtain the optimized pose of the object to be positioned; and the parameter adjusting module is used for adjusting network parameters in the pose estimation model based on the difference between the optimized pose and the sample initial pose.

The application provides a position appearance estimation device, includes: the target image acquisition module is used for acquiring a target image containing an object to be positioned, wherein the target image comprises a target color image and a target depth image corresponding to the target color image; the target pose estimation module is used for processing the target color image by using the pose estimation model to obtain a target initial pose of the object to be positioned; the target pose optimization module is used for optimizing the initial pose of the target based on the depth information of the object to be positioned in the target depth image to obtain the target pose of the object to be positioned; the image detection model is obtained by training through the pose estimation model training device.

The application provides an electronic device, which comprises a memory and a processor which are coupled with each other, wherein the processor is used for executing program instructions stored in the memory to realize the training method of the pose estimation model or realize the image detection method.

The present application provides a computer-readable storage medium having stored thereon program instructions that, when executed by a processor, implement the above-described pose estimation model training method, or implement the above-described image detection method.

According to the scheme, after the sample color image is processed by using the pose estimation model to obtain the sample initial pose of the object to be positioned, the sample initial pose is optimized by using the sample depth image, so that the optimized pose of the object to be positioned is more accurate. And by using the difference between the optimized pose and the sample pose, network parameters in the pose estimation model are adjusted, a sample color image is not required to be labeled, the labeling workload is reduced, and the training efficiency of the pose estimation model is improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present application and, together with the description, serve to explain the principles of the application.

FIG. 1 is a schematic flowchart of an embodiment of a method for training a pose estimation model according to the present application;

FIG. 2 is a sub-flowchart of step S13 in an embodiment of the pose estimation model training method of the present application;

FIG. 3 is another schematic flow chart diagram illustrating an embodiment of a training method for a pose estimation model of the present application;

FIG. 4 is a schematic flowchart of an embodiment of a pose estimation method of the present application;

FIG. 5 is a schematic structural diagram of an embodiment of a training apparatus for a pose estimation model of the present application;

FIG. 6 is a schematic structural diagram of an embodiment of the pose estimation apparatus of the present application;

FIG. 7 is a schematic structural diagram of an embodiment of an electronic device of the present application;

FIG. 8 is a schematic structural diagram of an embodiment of a computer-readable storage medium according to the present application.

Detailed Description

The following describes in detail the embodiments of the present application with reference to the drawings attached hereto.

In the following description, for purposes of explanation and not limitation, specific details are set forth such as particular system structures, interfaces, techniques, etc. in order to provide a thorough understanding of the present application.

The term "and/or" herein is merely an association describing an associated object, meaning that three relationships may exist, e.g., a and/or B, may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" herein generally indicates that the former and latter related objects are in an "or" relationship. Further, the term "plurality" herein means two or more than two. In addition, the term "at least one" herein means any one of a plurality or any combination of at least two of a plurality, for example, including at least one of A, B, C, and may mean including any one or more elements selected from the group consisting of A, B and C.

An executing main body of the method for training the pose estimation model provided by the embodiment of the present disclosure may be a training apparatus of the pose estimation model, and the training apparatus of the pose estimation model may be any terminal device or server or other processing device capable of executing the method of the embodiment of the present disclosure, where the terminal device may be a visual positioning device, a User Equipment (UE), a mobile device, a User terminal, a cellular phone, a cordless phone, a Personal Digital Assistant (PDA), a handheld device, a computing device, a vehicle-mounted device, a wearable device, or the like. In some possible implementations, the pose estimation model training method may be implemented by a processor calling computer readable instructions stored in a memory.

Referring to fig. 1, fig. 1 is a schematic flowchart illustrating an embodiment of a method for training a pose estimation model according to the present application. Specifically, the method may include the steps of:

step S11: and acquiring a sample image containing the object to be positioned, wherein the sample image contains a sample color image and a sample depth image corresponding to the sample color image.

The sample image may be a real image or a composite image. In some application scenarios, the sample image may include a partial real image and a partial composite image. In the case that the sample image is a real image, the mode of acquiring the sample image including the object to be positioned may be obtained by shooting the object to be positioned by the execution device executing the pose estimation model training method provided by the embodiment of the present disclosure, or transmitted to the execution device by communication connection after shooting the object to be positioned by other devices. In some disclosed embodiments, the disclosed image dataset for pose estimation may be used as a sample image.

And the pixel value of each pixel point in the sample depth image is used for expressing the depth value of the corresponding pixel point in the sample color image. The depth may specifically be a distance between a three-dimensional point corresponding to the pixel point and the shooting device.

Step S12: and processing the sample color image by using the pose estimation model to obtain the sample initial pose of the object to be positioned.

The pose estimation model can be a model which is pre-trained or a model which is not pre-trained. The pose estimation model can be a ResNet network or other networks with any structures. The pose estimation model can directly output the sample initial pose of the object to be positioned, and the intermediate result output based on the pose estimation model can be further processed to obtain the sample initial pose of the object to be positioned.

The sample initial pose may be a six-degree-of-freedom pose, that is, the sample initial pose contains the position and orientation of the object to be positioned in the camera coordinate system.

Step S13: and optimizing the initial pose of the sample based on the depth information of the object to be positioned in the sample depth image to obtain the optimized pose of the object to be positioned.

As described above, the pixel value of each pixel point in the sample depth image is used to represent the depth value of the corresponding pixel point in the sample color image. The depth may specifically be a distance between a three-dimensional point corresponding to the pixel point and the shooting device. Because the distance between the object to be positioned and the shooting equipment cannot be well expressed by the single sample color image, and the initial pose of the obtained sample is possibly inaccurate, the initial pose of the sample is optimized by combining the depth information of the object to be positioned in the sample depth image, so that the optimized pose obtained by optimization is more accurate.

Step S14: and adjusting network parameters in the pose estimation model based on the difference between the optimized pose and the sample initial pose.

Alternatively, the loss may be determined based on the difference between the optimized pose and the sample initial pose, and then the network parameters in the pose estimation model may be adjusted using the loss.

In some disclosed embodiments, the step S12 may include the steps of:

and carrying out target detection on the sample color image by using the pose estimation model to obtain the position of the object to be positioned. Illustratively, the pose estimation model includes a target detection sub-network, and the target detection sub-network is configured to perform target detection on the sample color image to obtain a position of the object to be positioned in the sample color image. In other disclosed embodiments, the target detection subnetwork and the pose estimation model may also be independent from each other, that is, the target detection network is first used to perform target detection on the sample color image to obtain the position of the object to be positioned, and then the pose estimation model processes the sample color image based on the detection result of the target detection network.

And then, based on the position of the object to be positioned, cutting the sample color image to obtain a local image containing the object to be positioned. For example, the cutting mode may be to perform cutting after the region where the object to be positioned is located on the sample color image is extended outward by a preset scale, and use the cut part containing the object to be positioned as the local image.

And then, processing the local image to obtain a sample initial pose of the object to be positioned. By detecting the position of the object to be positioned and cutting the color image of the sample, the interference of the background of the combined image on the processing result can be reduced, and the accuracy of the initial pose of the sample is improved. Illustratively, the background and the foreground of the synthesized image are likely to have a large difference, and if the synthesized image is directly processed, it is likely that the determined initial pose of the sample is not very accurate.

The method comprises the steps of firstly carrying out target detection on a sample color image to obtain the position of an object to be positioned, then cutting the sample color image to obtain a local image containing the object to be positioned, and processing the local image to obtain the sample initial pose of the object to be positioned, so that the interference of a background can be reduced, and the accuracy of the sample initial pose can be improved.

In some disclosed embodiments, the step S12 may further include the following steps:

and determining the projection positions of a plurality of three-dimensional key points of the object to be positioned on the sample color image by using the pose estimation model. As described above, the sample color image may be cut to obtain a local image including the object to be positioned, and the determining of the projection positions of the three-dimensional key points on the sample color image with respect to the object to be positioned may specifically be determining the projection positions of the three-dimensional key points on the local image. The three-dimensional key points of the object to be positioned can be extracted from a preset three-dimensional model corresponding to the object to be positioned. Illustratively, the three-dimensional key points can be obtained from a three-dimensional point set obtained from a preset three-dimensional model through a farthest point sampling algorithm.

And determining the sample initial pose of the object to be positioned based on the projection position of each three-dimensional key point on the sample color image and the internal parameters of the target camera. After the projection position of each three-dimensional key point on the local image is determined, the initial pose of the sample of the object to be positioned can be determined in a pnp mode by combining the internal reference of the target camera. Illustratively, the internal parameters of the target camera may include parameters such as focal length. The specific way of obtaining the sample initial pose of the object to be positioned by using pnp is not described herein too much. The target camera may be a camera that captures an image of the sample, as with the capture device described above.

The projection position of the three-dimensional key point of the object to be positioned on the sample color image can be determined through the pose estimation model, so that the sample initial pose of the object to be positioned is obtained according to the determined projection position of the three-dimensional key point and the internal parameters of the target camera.

The above-mentioned manner of determining the projection positions of the three-dimensional key points on the sample color image with respect to the object to be positioned by using the pose estimation model may be:

and predicting the direction vector of each object pixel point to each projection position by using the pose estimation model. And the object pixel points are pixel points belonging to an object to be positioned in the sample color image. As described above, the pose estimation model can perform target detection on the sample color image to obtain the position of the object to be positioned. The semantic labels of the pixel points belonging to the object to be positioned can be set to be a first preset value (for example, the first preset value can be 1), the semantic labels of the remaining pixel points not belonging to the object to be positioned are set to be a second preset value (for example, the second preset value can be 0), and the pixel points with the semantic labels being the first preset value are the object pixel points. The direction vector may be a two-dimensional vector, and one dimension may be a vector in the x-axis direction of the sample color image, and one dimension may be a vector in the y-axis direction of the sample color image.

For example, formula (1) may be referred to with respect to a direction vector of an object pixel point to a projection position.

v _k (p)＝x _k -p formula (1);

wherein v is _k (p) a direction vector, x, representing the p to kth projection position of the object pixel point _k Indicating the kth projection position and p indicating the position of pixel point p.

For each projection position, a preset number of direction vectors are determined from a plurality of direction vectors corresponding to the projection position, and candidate projection positions corresponding to the direction vectors are generated. Illustratively, summing the position of each object pixel point and the direction vector corresponding to the object pixel point to obtain a candidate projection position corresponding to each object pixel point. Exemplarily, if there are 10 object pixel points belonging to an object to be positioned in the sample color image and the preset number is 5, 5 direction vectors can be selected from the 10 direction vectors, each direction vector has a corresponding object pixel point, and each object pixel point is added to its corresponding direction vector, so as to obtain a candidate projection position corresponding to the object pixel point, that is, the obtained candidate projection position is 5. In some application scenarios, the pose estimation model is input into a plurality of sample color images, and the output result is a semantic label corresponding to each pixel point and a direction vector corresponding to each pixel point. And the semantic label is used for indicating whether the pixel point belongs to an object to be positioned or not. In some application scenarios, the input multiple sample color images may include multiple objects to be positioned, and the output semantic tag corresponding to each pixel point may be specifically a tag of which object to be positioned the pixel point belongs to. Illustratively, the number of objects to be positioned may include cups, tables, stools, and the like. That is, the pose estimation model obtained by the pose estimation model training method provided by the embodiment of the present disclosure can perform pose estimation on multiple objects to be positioned at the same time, so as to obtain the target pose of each object to be positioned.

The method for determining the corresponding candidate projection position based on the direction vector and the corresponding object pixel point can refer to formula (2).

h _k，i ＝p+v _k (p) formula (2);

wherein, { h } _k，i 1, 2., N }, where N is the number of candidate projection locations. p represents the object pixel point p, v _k (p) represents the direction vector of the object pixel point p to the kth projection position.

Then, based on the positional relationship between the respective candidate projection positions, the score of each candidate projection position is determined. And taking the candidate projection position with the score meeting the preset requirement as the projection position.

Optionally, the above-mentioned manner of determining the score of each candidate projection position based on the position relationship between the candidate projection positions may be: for each candidate projection position, determining the number of target distances between the candidate projection position and other candidate projection positions, and taking the number of target distances as a score. The target distance is a distance smaller than or equal to a preset distance. Exemplarily, the current candidate projection position is subtracted from other candidate projection positions to obtain a distance corresponding to the current candidate projection position and the other candidate projection positions. The preset distance can be adjusted in the training process of the posture estimation model to determine the final preset distance.

For example, the way of calculating the score of each candidate projection position may refer to formula (3).

w _k，i ＝∑I(|h _k，i -p-v _k (p)|≤θ) (3)；

Wherein, w _k，i And a score representing the ith projection candidate position, wherein I is an indication function, the condition is 1 when the condition is met, and the condition is 0 when the condition is not met. θ is a preset distance, for example θ may be 1.

The above-mentioned manner of using the candidate projection position whose score meets the preset requirement as the projection position may specifically be: and taking the candidate projection position corresponding to the maximum score as a projection position. The determination method of the projection positions corresponding to the other three-dimensional key points can refer to the above, and details are not repeated here.

In some application scenarios, the method for training the pose estimation model provided by the embodiment of the present disclosure may further include a pre-training step for the pose estimation model. The pre-training may include the following specific steps: a plurality of sample images, which may be the same as or different from the sample image obtained in step S11, are obtained. And (4) carrying out sample semantic label and sample projection position on each pixel point on the sample image. The method comprises the steps of obtaining first loss between semantic labels output by a pose estimation model and sample semantic labels, determining a projection position based on direction vectors of all object pixel points output by the pose estimation model, obtaining second loss between the projection position and the sample projection position, and adjusting network parameters in the pose estimation model by combining the first loss and the second loss. Wherein, in the pre-training, the initial learning rate may be set to 1e ^－3 After every first predetermined number of iterations, the learning rate is halved. After pre-training, the learning rate may be adjusted to 5e ^－4 After every second predetermined number of iterations, the learning rate is halved. Optionally, the first predetermined number of iterations is twice the second predetermined number of iterations.

A plurality of candidate projection positions are determined based on the direction vectors of the object pixel points relative to the projection positions, and then the candidate projection positions meeting the requirements are selected from the candidate projection positions to serve as the final projection positions, so that the determined projection positions are more accurate.

Referring to fig. 2, fig. 2 is a schematic sub-flow diagram illustrating step S13 according to an embodiment of the pose estimation model training method of the present application. As shown in fig. 2, the step S13 may include the following steps:

step S131: and determining a rendering depth map of the object to be positioned based on the sample initial pose and a preset three-dimensional model corresponding to the object to be positioned.

The preset three-dimensional model can be obtained by drawing through drawing software or by three-dimensionally constructing an object to be positioned by using a plurality of images containing the corresponding image to be positioned through a modeling network.

The initial pose of the sample can be regarded as the pose of the preset three-dimensional model corresponding to the object to be positioned in the camera coordinate system. The method for determining the rendering depth map of the object to be positioned based on the sample initial pose and the preset three-dimensional model corresponding to the object to be positioned may be that the preset three-dimensional model is projected onto a camera plane based on the sample initial pose to obtain the rendering depth map.

Step S132: an optimization term is determined using a difference between the rendered depth map and the sample depth image.

The method for training the pose estimation model further comprises the following steps: and determining a normal map of the object to be positioned based on the sample initial pose and the preset three-dimensional model. Illustratively, the predetermined three-dimensional model may be composed of several planes (e.g., triangular mesh planes), each of which corresponds to a pixel value of a pixel point in the normal map, and the pixel value may be used to represent a normal direction of the plane.

The method for determining the optimization term by using the difference between the rendering depth map and the sample depth image may be:

respectively carrying out back projection on the rendering depth image and the sample depth image to obtain the rendering depth and the corresponding first point cloud and the corresponding second point cloud of the sample depth image. The first point cloud comprises first three-dimensional points corresponding to the object pixel points, and the second point cloud comprises second three-dimensional points corresponding to the object pixel points. As described above, the object pixel point is a pixel point belonging to an object to be positioned in the sample color image. Specifically, back projection is carried out on the rendering depth map by using a sample initial pose of an object to be positioned to obtain a first point cloud; and carrying out back projection on the sample depth image by using the sample initial pose of the object to be positioned to obtain a second point cloud.

Then, for each object pixel point, a deviation representation value corresponding to the object pixel point is determined. The deviation characterizing value may be a residual. And the deviation representation value corresponding to the object pixel point is the product of the target pose difference corresponding to the object pixel point and the normal direction corresponding to the object pixel point in the normal map. And the target pose difference is the pose difference between a first three-dimensional point corresponding to the object pixel point and a corresponding second three-dimensional point.

For the object pixel point p, the mode of determining the deviation characterization value l (p) may refer to formula (4).

L(p)＝||(π ^-1 (D ^r (p))-π ^-1 (D(p)))N ^r (p)|| ₂ Formula (4);

wherein, pi ^-1 Is a back-projection function, D ^r (p) represents the depth value of the object pixel point p in the rendering depth map, D (p) represents the depth value of the object pixel point p in the sample depth map, N ^r And (p) representing the corresponding normal direction of the object pixel point p in the normal map. Pi ^-1 (D ^r (p))-π ^-1 (d (p)) represents a difference in pose between the first three-dimensional point and the corresponding second three-dimensional point.

For the deviation characterizing value of the object pixel point, the minimum distance from the second three-dimensional point to the plane of the first three-dimensional point, which is defined by the first three-dimensional point and its normal, may be used.

And then, determining an optimization item by combining the corresponding deviation characterization values of the pixel points of each object. For example, the sum of the deviation indicators, the mean value of the deviation indicators or the maximum value of the deviation indicators is used as the optimization term. These residual terms are minimized using a gradient descent method to achieve our desired optimized pose. In some application scenarios, to overcome the convergence of the result to a local minimum, a set of assumed sample initial poses may be generated by perturbing the sample initial poses. And then, optimizing the postures to obtain a more accurate optimized pose.

Step S133: and adjusting the initial pose of the sample to enable the optimization item to meet the preset requirement, and taking the adjusted initial pose of the sample as the optimization pose.

Alternatively, the preset requirement may be an optimization term minimization.

The method comprises the steps of determining a rendering depth map of an object to be positioned based on a sample initial pose and a preset three-dimensional model corresponding to the object to be positioned, then constructing an optimization term based on the difference between the rendering depth map and a sample depth image, and adjusting the sample initial pose by using the optimization term, so that the adjusted sample initial pose is more accurate.

In addition, the depth map and the sample depth image are rendered through back projection to obtain the first point cloud and the second point cloud, and the normal directions of all the points are gathered based on the difference between the three-dimensional points in the first point cloud and the three-dimensional points in the second point cloud, so that the determined deviation characterization value is more accurate.

In some disclosed embodiments, before performing step S14, the following steps may be performed:

and judging whether the optimized pose is a preset error estimation pose. And in response to the optimized pose not being the preset erroneous estimation pose, performing a step of adjusting network parameters in the pose estimation model based on a difference between the optimized pose and the sample initial pose.

Optionally, in response to the optimized pose being the preset erroneous estimated pose, discarding the optimized pose and not performing the step of adjusting the network parameters in the pose estimation model based on the difference between the optimized pose and the sample initial pose.

By using the difference between the optimized pose and the initial pose of the sample under the condition that the optimized pose is not the preset error estimation pose, the network parameters in the pose estimation model are adjusted more reasonably, and the disturbance of error estimation on the pose estimation model can be reduced.

The method for judging whether the optimized pose is the preset error estimation pose may be as follows:

and acquiring a centralized trend characteristic value among the deviation characteristic values corresponding to the pixel points of each object. As described above, the object pixel point is a pixel point belonging to an object to be positioned in the sample color image. And the corresponding deviation characterization value of the object pixel point is the product of the corresponding target pose difference of the object pixel point and the corresponding normal direction of the object pixel point. The target pose difference is the pose difference between the first three-dimensional point and the second three-dimensional point corresponding to the object pixel point. The first three-dimensional point is a three-dimensional point in a first point cloud corresponding to the rendering depth map, and the second three-dimensional point is a three-dimensional point in a second point cloud corresponding to the sample depth image. The first point cloud is obtained by back-projecting the rendering depth image, and the second point cloud is obtained by back-projecting the sample depth image.

Then, whether the central tendency representation value is smaller than or equal to a preset size is judged. Wherein the preset size is related to the size of the object to be positioned in the physical world. For example, the preset size may be 0.2 times the length of the object to be positioned.

And determining that the optimized pose is not a preset error estimation pose in response to the central tendency characterization value being less than or equal to a preset size. And determining the optimized pose as a preset error estimation pose in response to the central tendency characterization value being smaller than or equal to a preset size.

By considering that the optimized pose is not a preset erroneous estimation pose when the central tendency characterization value is not greater than the preset size, the optimized pose can be filtered based on the physical size of the object to be positioned.

For better understanding of the method for training the pose estimation model provided by the embodiment of the present disclosure, reference may be made to fig. 3, where fig. 3 is another schematic flow diagram of an embodiment of the method for training the pose estimation model according to the present disclosure. As shown in fig. 3, given a set of unannotated sample image data, the pose estimation model first predicts the initial pose of the object to be located on these sample color images. Pose optimization is then performed on these estimated sample initial poses by using the depth information. The pose optimization may be an iterative optimization. And then, performing pose evaluation on each optimized pose, namely judging whether each optimized pose is a preset error estimation or not, filtering the optimized poses based on the judgment result, and discarding the failure estimation. Finally, the pose estimation model is adjusted by the difference between the retained optimized pose and the sample initial pose. In the process of pose estimation, the method can predict the 6D pose of the object to be positioned only from one sample color image through a pose estimation model.

The pose estimation model training method provided by the embodiment of the disclosure can be applied to the field of augmented reality application.

Referring to fig. 4, fig. 4 is a schematic flowchart of an embodiment of a pose estimation method according to the present application. As shown in fig. 4, a pose estimation method provided by the embodiment of the present disclosure may include the following steps:

step S21: and acquiring a target image containing an object to be positioned, wherein the target image comprises a target color image and a target depth image corresponding to the target color image.

The target image containing the object to be positioned can be obtained by shooting of an execution device of the pose estimation method or by shooting of a device which is in communication connection with the execution device.

Step S22: and processing the target color image by using the pose estimation model to obtain the target initial pose of the object to be positioned.

The method for obtaining the initial pose of the target may refer to the method for obtaining the initial pose of the sample in the embodiment of the pose estimation model training method, and is not described here again. The pose estimation model is obtained by training with the training method provided by the embodiment of the pose estimation model training method.

Step S23: and optimizing the initial pose of the target based on the depth information of the object to be positioned in the target depth image to obtain the target pose of the object to be positioned.

The mode of obtaining the target pose of the object to be positioned may refer to the mode of obtaining the optimized pose in the embodiment of the training method for the pose estimation model, and is not described here again.

According to the scheme, after the target color image is processed by using the pose estimation model to obtain the target initial pose of the object to be positioned, the target initial pose is optimized by using the target depth image, so that the target pose of the optimized object to be positioned is more accurate.

The pose estimation method provided by the embodiment of the disclosure can be applied to the field of augmented reality application.

An executing main body of the pose estimation method provided by the embodiment of the present disclosure may be a pose estimation apparatus, which may be any terminal device or server or other processing device capable of executing the method embodiment of the present disclosure, where the terminal device may be an augmented reality display device, a visual positioning device, a User Equipment (UE), a mobile device, a User terminal, a cellular phone, a cordless phone, a Personal Digital Assistant (PDA), a handheld device, a computing device, a vehicle-mounted device, a wearable device, or the like. In some possible implementations, the pose estimation method may be implemented by a processor invoking computer readable instructions stored in a memory.

It will be understood by those skilled in the art that in the method of the present invention, the order of writing the steps does not imply a strict order of execution and any limitations on the implementation, and the specific order of execution of the steps should be determined by their function and possible inherent logic.

Referring to fig. 5, fig. 5 is a schematic structural diagram of an embodiment of a training apparatus for pose estimation model according to the present application. The training device 50 for the pose estimation model includes a sample image acquisition module 51, a sample pose estimation module 52, a sample pose optimization module 53, and a parameter adjustment module 54. The system comprises a sample image acquisition module 51, a position determination module and a position determination module, wherein the sample image acquisition module 51 is used for acquiring a sample image containing an object to be positioned, and the sample image contains a sample color image and a sample depth image corresponding to the sample color image; the sample pose estimation module 52 is configured to process the color image of the sample by using the pose estimation model to obtain an initial pose of the sample of the object to be positioned; the sample pose optimization module 53 is configured to optimize an initial pose of the sample based on depth information of the object to be positioned in the sample depth image, so as to obtain an optimized pose of the object to be positioned; a parameter adjustment module 54 for adjusting network parameters in the pose estimation model based on the difference between the optimized pose and the sample initial pose.

In some disclosed embodiments, the optimizing module 53 optimizes the initial pose of the sample based on the depth information of the object to be positioned in the sample depth image, to obtain an optimized pose of the object to be positioned, including: determining a rendering depth map of the object to be positioned based on the sample initial pose and a preset three-dimensional model corresponding to the object to be positioned; determining an optimization term by using the difference between the rendered depth map and the sample depth image; and adjusting the initial pose of the sample to enable the optimization item to meet the preset requirement, and taking the adjusted initial pose of the sample as the optimization pose.

According to the scheme, the rendering depth map of the object to be positioned is determined based on the sample initial pose and the preset three-dimensional model corresponding to the object to be positioned, then the optimization item is constructed based on the difference between the rendering depth map and the sample depth image, and the sample initial pose is adjusted by the optimization item, so that the adjusted sample initial pose is more accurate.

In some disclosed embodiments, the predetermined requirement is an optimization term minimization; and/or, the sample pose optimization module 53 is further configured to: determining a normal map of the object to be positioned based on the sample initial pose and a preset three-dimensional model; and, the sample pose optimization module 53 determines an optimization term using a difference between the rendered depth map and the sample depth image, including: respectively carrying out back projection on the rendering depth map and the sample depth image to obtain a first point cloud corresponding to the rendering depth map and a second point cloud corresponding to the sample depth image, wherein the first point cloud comprises first three-dimensional points corresponding to a plurality of object pixel points, the second point cloud comprises second three-dimensional points corresponding to each object pixel point, and the object pixel points are pixel points belonging to an object to be positioned in the sample color image; for each object pixel point, determining a deviation representation value corresponding to the object pixel point, wherein the deviation representation is the product of a target pose difference corresponding to the object pixel point and a normal direction corresponding to the object pixel point in a normal map, wherein the target pose difference is the pose difference between a first three-dimensional point corresponding to the object pixel point and a second three-dimensional point corresponding to the object pixel point; and determining an optimization item by combining the corresponding deviation characterization values of the pixel points of each object.

According to the scheme, the first point cloud and the second point cloud are obtained by rendering the depth map and the sample depth image through back projection, and the normal directions of all points are gathered based on the difference between the three-dimensional points in the first point cloud and the three-dimensional points in the second point cloud, so that the determined deviation representation value is more accurate.

In some disclosed embodiments, before adjusting the network parameters in the pose estimation model based on the difference between the optimized pose and the sample initial pose, the adjustment module 54 is further configured to: judging whether the optimized pose is a preset error estimation pose or not; in response to the optimized pose not being the preset erroneous estimation pose, performing a step of adjusting network parameters in the pose estimation model based on a difference between the optimized pose and the sample initial pose.

According to the scheme, under the condition that the optimized pose is not the preset error estimation pose, the difference between the optimized pose and the initial pose of the sample is used, network parameters in the pose estimation model are adjusted more reasonably, and disturbance of error estimation on the pose estimation model can be reduced.

In some disclosed embodiments, the determining whether the optimized pose is the preset error estimation pose by the adjustment module 54 includes: acquiring a centralized trend characterization value between deviation characterization values corresponding to each object pixel point, wherein the object pixel points are pixel points belonging to an object to be positioned in a sample color image, the deviation characterization values corresponding to the object pixel points are the product of a target pose difference corresponding to the object pixel points and a normal direction corresponding to the object pixel points, the target pose difference is a pose difference between a first three-dimensional point corresponding to the object pixel points and a corresponding second three-dimensional point, the first three-dimensional point is a three-dimensional point in a first point cloud corresponding to a rendering depth map, and the second three-dimensional point is a three-dimensional point in a second point cloud corresponding to the sample depth image; judging whether the centralized trend representation value is smaller than or equal to a preset size, wherein the preset size is related to the size of the object to be positioned in the physical world; and determining that the optimized pose is not a preset error estimation pose in response to the central tendency characterization value being less than or equal to a preset size.

According to the scheme, under the condition that the concentration trend characteristic value is not larger than the preset size, the optimization pose is not considered to be the preset error estimation pose, and the optimization pose can be filtered based on the physical size of the object to be positioned.

In some disclosed embodiments, the processing the color image of the sample by the sample pose estimation module 52 using the pose estimation model to obtain the initial pose of the sample of the object to be positioned includes: determining projection positions of a plurality of three-dimensional key points of an object to be positioned on a sample color image by using a pose estimation model; and determining the sample initial pose of the object to be positioned based on the projection position of each three-dimensional key point on the sample color image and the internal parameters of the target camera.

According to the scheme, the projection position of the three-dimensional key point of the object to be positioned on the sample color image can be determined through the pose estimation model, so that the sample initial pose of the object to be positioned is obtained according to the determined projection position of the three-dimensional key point and the internal parameters of the target camera.

In some disclosed embodiments, the sample pose estimation module 52 determines the projection positions of a number of three-dimensional keypoints for the object to be located on the sample color image using a pose estimation model, comprising: predicting a direction vector from each object pixel point to each projection position by using a pose estimation model, wherein the object pixel points are pixel points belonging to an object to be positioned in a sample color image; for each projection position, determining a preset number of direction vectors from a plurality of direction vectors corresponding to the projection position, and generating candidate projection positions corresponding to the direction vectors; determining the scores of the candidate projection positions based on the position relation among the candidate projection positions; and taking the candidate projection position with the score meeting the preset requirement as the projection position.

According to the scheme, the plurality of candidate projection positions are determined based on the direction vectors of the pixel points relative to the projection positions, and then the candidate projection positions meeting the requirements are selected to serve as the final projection positions, so that the determined projection positions are more accurate.

In some disclosed embodiments, the sample pose estimation module 52 determines a preset number of direction vectors from a plurality of direction vectors corresponding to the projection positions, and generates candidate projection positions corresponding to each direction vector, including: summing the position of each object pixel point and the direction vector corresponding to the object pixel point to obtain a candidate projection position corresponding to each object pixel point; determining the score of each candidate projection position based on the position relation among the candidate projection positions, comprising: for each candidate projection position, determining the number of target distances between the candidate projection position and other candidate projection positions, and taking the number of the target distances as a score, wherein the target distance is a distance less than or equal to a preset distance; taking the candidate projection position with the score meeting the preset requirement as a projection position, including: and taking the candidate projection position corresponding to the maximum score as a projection position.

According to the scheme, the final projection position is determined and obtained by determining the distance between the candidate projection positions, so that the determined projection position is more accurate.

In some disclosed embodiments, the processing the color image of the sample by the sample pose estimation module 52 using the pose estimation model to obtain the initial pose of the sample of the object to be positioned includes: performing target detection on the sample color image by using a pose estimation model to obtain the position of an object to be positioned; based on the position of the object to be positioned, cutting the sample color image to obtain a local image containing the object to be positioned; and processing the local image to obtain a sample initial pose of the object to be positioned.

According to the scheme, the target detection is carried out on the sample color image to obtain the position of the object to be positioned, then the sample color image is cut to obtain the local image containing the object to be positioned, and the local image is processed to obtain the sample initial pose of the object to be positioned, so that the interference of the background can be reduced, and the accuracy of the sample initial pose is improved.

Referring to fig. 6, fig. 6 is a schematic structural diagram of an embodiment of the pose estimation apparatus according to the present application. The pose estimation apparatus 60 includes a target image acquisition module 61, a target pose estimation module 62, and a target pose optimization module 63. A target image obtaining module 61, configured to obtain a target image including an object to be positioned, where the target image includes a target color image and a target depth image corresponding to the target color image; a target pose estimation module 62, configured to process the target color image by using a pose estimation model to obtain a target initial pose of the object to be positioned; a target pose optimization module 63, configured to optimize the target initial pose based on the depth information of the object to be positioned in the target depth image, to obtain a target pose of the object to be positioned; the image detection model is obtained by training the pose estimation model provided in the embodiment of the pose estimation model training device.

Referring to fig. 7, fig. 7 is a schematic structural diagram of an embodiment of an electronic device according to the present application. The electronic device 70 comprises a memory 71 and a processor 72 coupled to each other, the processor 72 being configured to execute program instructions stored in the memory 71 to implement the steps of any of the above-described embodiments of the pose estimation model training method, or to implement the steps of any of the above-described embodiments of the pose estimation method. In one particular implementation scenario, the electronic device 70 may include, but is not limited to: a microcomputer, a server, and the electronic device 70 may also include a mobile device such as a notebook computer, a tablet computer, and the like, which is not limited herein.

Specifically, the processor 72 is configured to control itself and the memory 71 to implement the steps of any of the above-described embodiments of the pose estimation model training method, or to implement the steps of any of the above-described embodiments of the pose estimation method. The processor 72 may also be referred to as a CPU (Central Processing Unit). The processor 72 may be an integrated circuit chip having signal processing capabilities. The Processor 72 may also be a general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. Additionally, the processor 72 may be collectively implemented by an integrated circuit chip.

Referring to fig. 8, fig. 8 is a schematic structural diagram of an embodiment of a computer-readable storage medium according to the present application. The computer readable storage medium 80 stores program instructions 801 that can be executed by the processor, the program instructions 801 being for implementing the steps of any of the pose estimation model training method embodiments described above, or implementing the steps of any of the pose estimation method embodiments described above.

According to the scheme, after the sample color image is processed by using the pose estimation model to obtain the sample initial pose of the object to be positioned, the sample initial pose is optimized by using the sample depth image, so that the optimized pose of the object to be positioned is more accurate. And by using the difference between the optimized pose and the sample pose, network parameters in the pose estimation model are adjusted, a sample color image does not need to be marked, the marking workload is reduced, and the training efficiency of the pose estimation model is improved.

In some embodiments, functions of or modules included in the apparatus provided in the embodiments of the present disclosure may be used to execute the method described in the above method embodiments, and specific implementation thereof may refer to the description of the above method embodiments, and for brevity, will not be described again here.

The foregoing description of the various embodiments is intended to highlight various differences between the embodiments, and the same or similar parts may be referred to each other, and for brevity, will not be described again herein.

In the several embodiments provided in the present application, it should be understood that the disclosed method and apparatus may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, a division of a module or a unit is merely one type of logical division, and an actual implementation may have another division, for example, a unit or a component may be combined or integrated with another system, or some features may be omitted, or not implemented. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some interfaces, and may be in an electrical, mechanical or other form.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, a network device, or the like) or a processor (processor) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

If the technical scheme of the application relates to personal information, a product applying the technical scheme of the application clearly informs personal information processing rules before processing the personal information, and obtains personal independent consent. If the technical scheme of the application relates to sensitive personal information, a product applying the technical scheme of the application obtains individual consent before processing the sensitive personal information, and simultaneously meets the requirement of 'express consent'. For example, at a personal information collection device such as a camera, a clear and significant identifier is set to inform that the personal information collection range is entered, the personal information is collected, and if the person voluntarily enters the collection range, the person is regarded as agreeing to collect the personal information; or on the device for processing the personal information, under the condition of informing the personal information processing rule by using obvious identification/information, obtaining personal authorization by modes of popping window information or asking a person to upload personal information of the person by himself, and the like; the personal information processing rule may include information such as a personal information processor, a personal information processing purpose, a processing method, and a type of personal information to be processed.

Claims

1. A method for training a pose estimation model is characterized by comprising the following steps:

obtaining a sample image containing an object to be positioned, wherein the sample image contains a sample color image and a sample depth image corresponding to the sample color image;

processing the sample color image by using a pose estimation model to obtain a sample initial pose of the object to be positioned;

optimizing the initial pose of the sample based on the depth information of the object to be positioned in the sample depth image to obtain the optimized pose of the object to be positioned;

adjusting network parameters in the pose estimation model based on a difference between the optimized pose and the sample initial pose.

2. The method of claim 1, wherein the optimizing the sample initial pose based on the depth information of the object to be positioned in the sample depth image to obtain an optimized pose of the object to be positioned comprises:

determining a rendering depth map of the object to be positioned based on the sample initial pose and a preset three-dimensional model corresponding to the object to be positioned;

determining an optimization term using a difference between the rendered depth map and the sample depth image;

and adjusting the initial pose of the sample to enable the optimization item to meet a preset requirement, and taking the adjusted initial pose of the sample as the optimization pose.

3. The method of claim 2, wherein the preset requirement is that the optimization term is minimized; and/or the presence of a gas in the gas,

the method further comprises the following steps: determining a normal map of the object to be positioned based on the sample initial pose and the preset three-dimensional model; and determining an optimization term using a difference between the rendered depth map and the sample depth image, comprising:

respectively carrying out back projection on the rendering depth map and the sample depth image to obtain a first point cloud corresponding to the rendering depth map and a second point cloud corresponding to the sample depth image, wherein the first point cloud comprises first three-dimensional points corresponding to a plurality of object pixel points, the second point cloud comprises second three-dimensional points corresponding to each object pixel point, and the object pixel points are pixel points belonging to the object to be positioned in the sample color image;

for each object pixel point, determining a deviation characterization value corresponding to the object pixel point, wherein the deviation characterization is a product of a target pose difference corresponding to the object pixel point and a normal direction corresponding to the object pixel point in the normal map, and the target pose difference is a pose difference between a first three-dimensional point corresponding to the object pixel point and a corresponding second three-dimensional point;

and determining the optimization items by combining the corresponding deviation characterization values of the object pixel points.

4. The method of claim 2 or 3, wherein before adjusting network parameters in the pose estimation model based on the difference between the optimized pose and the sample initial pose, the method further comprises:

judging whether the optimized pose is a preset error estimation pose or not;

in response to the optimized pose not being the preset erroneous estimation pose, performing the step of adjusting network parameters in the pose estimation model based on a difference between the optimized pose and the sample initial pose.

5. The method of claim 4, wherein the determining whether the optimized pose is a preset error estimation pose comprises:

acquiring a central tendency characterization value between deviation characterization values corresponding to each object pixel point, wherein the object pixel point is a pixel point belonging to the object to be positioned in the sample color image, the deviation characterization value corresponding to the object pixel point is a product of a target pose difference corresponding to the object pixel point and a normal direction corresponding to the object pixel point, the target pose difference is a pose difference between a first three-dimensional point corresponding to the object pixel point and a corresponding second three-dimensional point, the first three-dimensional point is a three-dimensional point in a first point cloud corresponding to the rendered depth map, and the second three-dimensional point is a three-dimensional point in a second point cloud corresponding to the sample depth image;

judging whether the centralized trend representation value is smaller than or equal to a preset size, wherein the preset size is related to the size of the object to be positioned in the physical world;

determining that the optimized pose is not the preset false estimated pose in response to the central tendency characterization value being less than or equal to the preset size.

6. The method according to any one of claims 1 to 5, wherein the processing the sample color image by using a pose estimation model to obtain a sample initial pose of the object to be positioned comprises:

determining projection positions of a plurality of three-dimensional key points of the object to be positioned on the sample color image by using the pose estimation model;

and determining the sample initial pose of the object to be positioned based on the projection position of each three-dimensional key point on the sample color image and the internal parameters of the target camera.

7. The method according to claim 6, wherein the determining, by using the pose estimation model, projection positions of a plurality of three-dimensional key points on the sample color image with respect to the object to be positioned comprises:

predicting a direction vector from each object pixel point to each projection position by using the pose estimation model, wherein the object pixel points are pixel points belonging to the object to be positioned in the sample color image;

for each projection position, determining a preset number of direction vectors from a plurality of direction vectors corresponding to the projection position, and generating candidate projection positions corresponding to the direction vectors;

determining a score of each candidate projection position based on a position relation between the candidate projection positions;

and taking the candidate projection position with the score meeting the preset requirement as the projection position.

8. The method of claim 7, wherein determining a preset number of direction vectors from a plurality of direction vectors corresponding to the projection positions, and generating candidate projection positions corresponding to each of the direction vectors comprises:

summing the position of each object pixel point and the direction vector corresponding to the object pixel point to obtain a candidate projection position corresponding to each object pixel point;

determining a score of each candidate projection position based on a position relationship between the candidate projection positions, including:

for each candidate projection position, determining the number of target distances between the candidate projection position and other candidate projection positions, and taking the number of target distances as the score, wherein the target distance is a distance smaller than or equal to a preset distance;

the step of taking the candidate projection position with the score meeting the preset requirement as the projection position comprises:

and taking the candidate projection position corresponding to the maximum score as the projection position.

9. The method according to any one of claims 1-8, wherein the processing the sample color image with a pose estimation model to obtain a sample initial pose of the object to be positioned comprises:

performing target detection on the sample color image by using the pose estimation model to obtain the position of the object to be positioned;

based on the position of the object to be positioned, cutting the sample color image to obtain a local image containing the object to be positioned;

and processing the local image to obtain a sample initial pose of the object to be positioned.

10. A pose estimation method, comprising:

acquiring a target image containing an object to be positioned, wherein the target image comprises a target color image and a target depth image corresponding to the target color image;

processing the target color image by using a pose estimation model to obtain a target initial pose of the object to be positioned;

optimizing the initial pose of the target based on the depth information of the object to be positioned in the target depth image to obtain the target pose of the object to be positioned;

wherein the image detection model is trained by using the pose estimation model training method according to any one of claims 1 to 9.

11. A training device for a pose estimation model, comprising:

the system comprises a sample image acquisition module, a position determination module and a position determination module, wherein the sample image acquisition module is used for acquiring a sample image containing an object to be positioned, and the sample image contains a sample color image and a sample depth image corresponding to the sample color image;

the sample pose estimation module is used for processing the sample color image by using a pose estimation model to obtain a sample initial pose of the object to be positioned;

the sample pose optimization module is used for optimizing the initial pose of the sample based on the depth information of the object to be positioned in the sample depth image to obtain the optimized pose of the object to be positioned;

a parameter adjustment module to adjust network parameters in the pose estimation model based on a difference between the optimized pose and the sample initial pose.

12. A pose estimation apparatus, characterized by comprising:

the target image acquisition module is used for acquiring a target image containing an object to be positioned, wherein the target image comprises a target color image and a target depth image corresponding to the target color image;

the target pose estimation module is used for processing the target color image by using a pose estimation model to obtain a target initial pose of the object to be positioned;

the target position and pose optimization module is used for optimizing the initial position and pose of the target based on the depth information of the object to be positioned in the target depth image to obtain the target position and pose of the object to be positioned;

wherein the image detection model is trained by the pose estimation model training apparatus according to claim 11.

13. An electronic device comprising a memory and a processor coupled to each other, the processor being configured to execute program instructions stored in the memory to implement the method for training a pose estimation model according to any one of claims 1 to 9 or to implement the pose estimation method according to claim 10.

14. A computer-readable storage medium having stored thereon program instructions, which when executed by a processor, implement the pose estimation model training method of any one of claims 1 to 9, or implement the pose estimation method of claim 10.