CN114241013B

CN114241013B - Object anchoring method, anchoring system and storage medium

Info

Publication number: CN114241013B
Application number: CN202210173770.0A
Authority: CN
Inventors: 张旭; 毛文涛; 邓伯胜; 于天慧; 蔡宝军
Original assignee: Beijing Yingchuang Information Technology Co ltd
Current assignee: Beijing Yingchuang Information Technology Co ltd
Priority date: 2022-02-25
Filing date: 2022-02-25
Publication date: 2022-05-10
Anticipated expiration: 2042-02-25
Also published as: CN114241013A

Abstract

The application provides an object anchoring method, an anchoring system and a storage medium, wherein the object anchoring method comprises the following steps: training according to the acquired image sequence containing the interested object to obtain a three-dimensional model of the interested object and a six-degree-of-freedom pose estimation neural network model for estimating the pose of the object; and performing pose estimation on the interested object according to the three-dimensional model of the interested object and the six-degree-of-freedom pose estimation neural network model for object pose estimation to obtain the pose of the interested object, and superposing virtual information on the interested object according to the pose to realize the rendering of the interested object. The method and the device can solve the problems of user-defined object identification and 3DThe inaccuracy and illumination and environment during tracking have great influence on the algorithm, so that the method for self-defining object information gain and display of the mobile terminal is realized, and the information is displayed and matched with the object 3DThe position and attitude correspond.

Description

Object anchoring method, anchoring system and storage medium

Technical Field

The application belongs to the technical field of image recognition, and particularly relates to an object anchoring method, an anchoring system and a storage medium.

Background

Common object recognition and 3D position and posture tracking deep learning algorithms require a large amount of manual labeling data, and user-defined object training is difficult to ensure accuracy under various complex illumination and environments. In the prior art, a feature engineering method is used, and features such as SIFT and SURF are used, and although the features have certain robustness on an illumination background, the features are sensitive to a somewhat complex illumination background and are easy to fail in tracking. Many existing methods require a user to give an initial pose and provide an accurate 3D model, which cannot be tracked for objects without a 3D model.

Disclosure of Invention

To overcome, at least to some extent, the problems in the related art, the present application provides an object anchoring method, an anchoring system, and a storage medium.

According to a first aspect of embodiments herein, there is provided a method of anchoring an object, comprising the steps of:

training according to the acquired image sequence containing the interested object to obtain a three-dimensional model of the interested object and a six-degree-of-freedom pose estimation neural network model for estimating the pose of the object;

and performing pose estimation on the interested object according to the three-dimensional model of the interested object and the six-degree-of-freedom pose estimation neural network model for object pose estimation to obtain the pose of the interested object, and superposing virtual information on the interested object according to the pose to realize the rendering of the interested object.

In the above object anchoring method, the modeling is completed based on deep learning or computer vision in the process of obtaining the three-dimensional model of the object of interest through training according to the obtained image sequence containing the object of interest.

Further, the process of completing modeling based on deep learning is as follows:

extracting the characteristics of each frame of image, and estimating the camera initialization pose corresponding to each frame of image;

acquiring a mask of each frame of image by utilizing a pre-trained significance segmentation network;

model training and inference are performed to obtain a mesh of the model.

Further, the process of performing model training and inference is as follows:

in the image

Up random acquisitionKEach pixel point has a position coordinate of

；

Position coordinates of each pixel point by using internal parameters

Conversion to imaging plane coordinates

；

Inputting the imaging plane coordinates and the optimized camera pose into a neural network

Extracting the color difference characteristics between frames

(ii) a Characterizing color differences between frames

And adding the color difference to the original image to compensate the color difference between frames.

Wherein the color difference characteristic between frames

Comprises the following steps:

，

initializing the camera pose corresponding to the image

Input neural network

In the method, the optimized pose is obtained

；

Wherein the optimized pose

Comprises the following steps:

；

according to the optimized pose

Obtaining an initial position of an optimized camera

；

Wherein, the initial position of camera after optimizing is:

；

in the formula (I), the compound is shown in the specification,Tis a function, which represents taking the position coordinates;

initial position of self-optimized camera

Emitting light rays in a direction ofwPassing through the position coordinates of the pixel points

；

Wherein the direction of the lightwComprises the following steps:

；

in the direction ofwSamplingMDot

This isMDot

Has the coordinates of

；

Utilizing deep learning networks

Predict thisMDot

Probability at the surface of the implicit equation (i.e., implicit function TSDF);

wherein, the judgment condition of the point predicted to be on the surface of the implicit equation is as follows:

；

in the formula (I), the compound is shown in the specification,

representing points predicted to be on the surface of the implicit equation,

a threshold value is indicated which is indicative of,

indicating minimum compliance

；

Will predict as points on the surface of the implicit equation

Send into neural rendererRObtaining the values of the predicted RGB colors

；

Wherein the predicted RGB color values

Comprises the following steps:

；

according to prediction

Value and acquisitionKCalculating the color of each pixel point to obtain the square loss of the pixel difference value;

wherein the square loss of pixel differenceLComprises the following steps:

；

in the formula (I), the compound is shown in the specification,

all represent coefficients;

representing the difference values of the pixels of the image,

difference value representing background mask

Difference from foreground mask

The sum of the total weight of the components,

representing a difference of the edges;

in which the difference of the image pixels

Comprises the following steps:

；

in the formula (I), the compound is shown in the specification,Pindicating all selectionskPoint;

difference of background mask

Comprises the following steps:

；

in the formula (I), the compound is shown in the specification,

indicating all selectionskOut of the dots, dots outside the mask;

difference of foreground mask

Comprises the following steps:

；

in the formula (I), the compound is shown in the specification,BCErepresenting a two-value cross-entropy loss,

indicating all selectionskOne of the dots within the mask;

difference of edge

Comprises the following steps:

；

in the formula (I), the compound is shown in the specification,

representing the boundaries of the mask;

when the model deduces, the spiritOver a network

Deep learning network

And neural networks

Input 3 in the combined model ofDPoint; the combined model is used to obtain the points present on its surface, from which a mesh is formed.

Further, the process of completing the modeling based on the computer vision is as follows:

performing feature extraction and matching by adopting a visual algorithm or a deep learning algorithm;

estimating the pose of the camera;

segmenting salient objects in the image sequence;

reconstructing the dense point cloud;

using the reconstructed dense point cloud as the input of grid generation, and reconstructing the grid of the object by using a reconstruction algorithm;

finding out texture coordinates corresponding to the grid vertex according to the camera pose and the image corresponding to the camera pose to obtain a mapping of the grid;

and obtaining a three-dimensional model according to the grids of the object and the mapping of the grids.

In the object anchoring method, the specific process of training the six-degree-of-freedom pose estimation neural network model for object pose estimation according to the acquired image sequence containing the object of interest comprises the following steps:

obtaining a synthetic data set by adopting a PBR rendering method according to the three-dimensional model and the preset scene model of the object; the synthetic dataset includes synthetic training data;

obtaining a real data set by adopting a model reprojection segmentation algorithm according to the camera pose and the object pose; the real dataset comprises real training data;

and training the six-degree-of-freedom pose estimation neural network based on deep learning by utilizing the synthetic training data and the real training data to obtain a six-degree-of-freedom pose estimation neural network model.

Further, the specific process of obtaining the synthetic data set by using the PBR rendering method according to the three-dimensional model and the preset scene model of the object is as follows:

reading a three-dimensional model and a preset scene model of an object;

carrying out object pose randomization, rendering camera pose randomization, material randomization and illumination randomization by adopting a PBR rendering method to obtain a series of image sequences and corresponding labeling labels; the label labels are of category, position and pose with six degrees of freedom.

Further, the specific process of obtaining the real data set by using the model re-projection segmentation algorithm according to the camera pose and the object pose is as follows:

acquiring an image sequence, a camera pose and an object pose, and segmenting an object in a real image;

synthesizing the real data with discrete poses into data with dense and continuous poses, and further obtaining a real image and a corresponding label thereof; the label labels are of category, position and pose with six degrees of freedom.

Furthermore, the specific process of training the six-degree-of-freedom pose estimation neural network based on deep learning by using the synthetic training data and the real training data to obtain the six-degree-of-freedom pose estimation neural network model is as follows:

inputting 2D coordinates of a plurality of characteristic points extracted from an image and an object, 3D coordinates corresponding to the characteristic points and an image mask;

training the six-degree-of-freedom pose estimation neural network by adopting the following loss function to obtain a six-degree-of-freedom pose estimation neural network model;

the loss function needed when training the six-degree-of-freedom pose estimation neural network is as follows:

；

in the formula (I), the compound is shown in the specification,

the loss is indicated by an indication of,

are all indicative of the coefficients of the,

a loss of classification is indicated and,

indicating that the loss of the bounding box,

which represents the loss in the 2D representation,

which represents the loss in 3D to the user,

which is indicative of a loss of the mask,

representing a projection loss;

wherein the classification is lost

Comprises the following steps:

；

in the formula (I), the compound is shown in the specification,

is shown to take the first placeiThe classification information of each of the detection anchor points,

is shown to take the first placejInformation of individual background features;

the anchor point is represented by a representation of,

an anchor point representing the background is shown,

a true value of the category is represented,

representing features proposed by a neural network;

loss of bounding box

Comprises the following steps:

；

in the formula (I), the compound is shown in the specification,

is shown asiThe coordinate characteristics of each of the detection anchor points,

representing the true value of the coordinate of the detection box;

2D loss

Comprises the following steps:

；

in the formula (I), the compound is shown in the specification,

is expressed as 2DThe characteristics of the coordinates are such that,

2 for representing an objectDThe true value of the characteristic point is that,

feature points and masks representing neural network predictions;

3D loss

Comprises the following steps:

；

in the formula (I), the compound is shown in the specification,

is expressed by 3DThe characteristics of the coordinates are such that,

3 for representing an objectDThe true value of the characteristic point is that,

feature points and masks representing neural network predictions;

mask loss

Comprises the following steps:

；

in the formula (I), the compound is shown in the specification,

first to show the prospectiThe characteristics of the device are as follows,

indicating taking the backgroundjThe characteristics of the device are as follows,fgthe representation of the foreground is performed,bgrepresenting a background;

loss of projection

Comprises the following steps:

；

in the formula (I), the compound is shown in the specification,

is shown as 3DFeature projection to 2DRear sum 2DThe true value is used for making a difference value,

feature points and masks representing the neural network predictions.

In the object anchoring method, the rendering of the interested object is realized through a mobile terminal or through the mixing of the mobile terminal and a cloud server;

the process realized by the mobile terminal is as follows:

before tracking is started, accessing a cloud server, downloading an object model, a deep learning model and a feature database of a user, and then performing other calculations on a mobile terminal;

the mobile terminal reads camera data from the equipment, and the object pose is obtained by detecting or identifying the neural network and estimating the neural network by the pose of six degrees of freedom;

rendering the content to be rendered according to the pose of the object;

the process of realizing the mixing of the mobile terminal and the cloud server is as follows:

inputting an image sequence in the mobile terminal, and performing significance detection on each frame of image;

uploading the significance detection area to a cloud server for retrieval to obtain information of the object and a deep learning model related to the information, and loading the information to the mobile terminal;

estimating the pose of the object at the mobile terminal to obtain the pose of the object;

and rendering the content to be rendered according to the pose of the object.

According to a second aspect of the embodiments of the present application, there is also provided an object anchoring system, which includes a cloud training unit and an object pose calculation and rendering unit;

the cloud training unit is used for training according to the acquired image sequence containing the interested object to obtain a three-dimensional model of the interested object and a six-degree-of-freedom pose estimation neural network model for estimating the pose of the object;

the object pose calculation and rendering unit is used for estimating the pose of the interested object according to the three-dimensional model of the interested object and the six-degree-of-freedom pose estimation neural network model for estimating the pose of the interested object, and superposing virtual information on the interested object to realize the rendering of the interested object;

the cloud training unit comprises a modeling unit, a synthetic training data generating unit, a real training data generating unit and a training algorithm unit;

the modeling unit is used for training according to the acquired image sequence containing the interested object to obtain a three-dimensional model of the interested object;

the synthetic training data generation unit is used for obtaining a synthetic data set according to a three-dimensional model of an object and a preset scene model, and the synthetic data set comprises synthetic training data;

the real training data generation unit is used for obtaining a real data set according to the camera pose and the object pose, and the real data set comprises real training data;

and the training algorithm unit is used for training the six-degree-of-freedom pose estimation neural network based on deep learning according to the synthetic training data and the real training data to obtain a six-degree-of-freedom pose estimation neural network model.

According to a third aspect of embodiments of the present application, there is also provided a storage medium having an executable program stored thereon, which when called, performs the steps in the object anchoring method described in any one of the above.

According to the above embodiments of the present application, at least the following advantages are obtained: according to the object anchoring method, the model which is used for carrying out recognition and 3D position and posture tracking by using the 2D image is trained by adopting synthetic data synthesis and real data synthesis, the problem that inaccuracy, illumination, environment and the like have great influence on the algorithm when a user self-defines object recognition and 3D tracking can be solved, and then the method for obtaining and displaying the self-defined object information of the mobile terminal is realized, and the information is displayed and corresponds to the 3D position and posture of the object.

According to the object anchoring method, the problem that workload of manual marking is large and speed is low can be solved by adopting a method of combining modeling rendering synthetic data and automatic marking real data, efficiency and accuracy of model training are improved, a deep learning model of a user-defined object can be tracked possibly, and the tracking initialization can be automatic initialization and is low in sensitivity to illumination, environment and the like.

According to the object anchoring method, the end cloud combined framework is adopted, so that large-scale object recognition and 3D position and posture tracking of the mobile terminal are possible.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the scope of the invention, as claimed.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of the specification of the application, illustrate embodiments of the application and together with the description, serve to explain the principles of the application.

Fig. 1 is a flowchart of an object anchoring method according to an embodiment of the present disclosure.

Fig. 2 is a block diagram of an object anchoring system according to an embodiment of the present invention.

Fig. 3 is a block diagram of a structure of a cloud-end training unit in an object anchoring system according to an embodiment of the present disclosure.

Fig. 4 is a block diagram illustrating a structure of a deep learning-based modeling unit in an object anchoring system according to an embodiment of the present disclosure.

Fig. 5 is a schematic diagram of a modeling process of a modeling unit based on computer vision in an object anchoring system according to an embodiment of the present application.

Fig. 6 is a block diagram illustrating a structure of a synthesized training data generating unit in an object anchoring system according to an embodiment of the present disclosure.

Fig. 7 is a flowchart illustrating a processing of a PBR rendering unit in an object anchoring system according to an embodiment of the present disclosure.

Fig. 8 is a flowchart illustrating a process of a composite image reality migration unit in an object anchoring system according to an embodiment of the present application.

Fig. 9 is a block diagram illustrating a structure of a real training data generating unit in an object anchoring system according to an embodiment of the present disclosure.

Fig. 10 is a flowchart of an implementation of an object pose calculation and rendering unit in an object anchoring system by a mobile terminal according to an embodiment of the present disclosure.

Fig. 11 is a flowchart of an implementation of an object pose calculation and rendering unit in an object anchoring system by mixing a mobile terminal and a cloud server according to an embodiment of the present disclosure.

Description of reference numerals:

1. a cloud training unit;

11. a modeling unit;

12. a synthetic training data generating unit; 121. a PBR rendering unit; 122. a composite image reality migration unit;

13. a real training data generating unit; 131. a model reprojection segmentation algorithm unit; 132. an inter-frame data synthesis unit;

14. a training algorithm unit;

2. and an object pose calculating and rendering unit.

Detailed Description

For the purpose of promoting a clear understanding of the objects, aspects and advantages of the embodiments of the present application, reference will now be made to the accompanying drawings and detailed description, wherein like reference numerals refer to like elements throughout.

The illustrative embodiments and descriptions of the present application are provided to explain the present application and not to limit the present application. Additionally, the same or similar numbered elements/components used in the drawings and the embodiments are used to represent the same or similar parts.

As used herein, "first," "second," …, etc., are not specifically intended to mean in a sequential or chronological order, nor are they intended to limit the application, but merely to distinguish between elements or operations described in the same technical language.

As used herein, the terms "comprising," "including," "having," "containing," and the like are open-ended terms that mean including, but not limited to.

As used herein, "and/or" includes any and all combinations of the described items.

References to "plurality" herein include "two" and "more than two"; reference to "multiple sets" herein includes "two sets" and "more than two sets".

Certain words used to describe the present application are discussed below or elsewhere in this specification to provide additional guidance to those skilled in the art in describing the present application.

As shown in fig. 1, an object anchoring method provided in an embodiment of the present application includes the following steps:

and S1, training according to the acquired image sequence containing the interested object to obtain a three-dimensional model of the interested object and a six-degree-of-freedom pose estimation neural network model for estimating the pose of the object.

S2, performing pose estimation on the interested object according to the three-dimensional model of the interested object and the six-degree-of-freedom pose estimation neural network model for object pose estimation to obtain the pose of the interested object, and overlaying virtual information on the interested object according to the pose to realize the rendering of the interested object.

In the step S1, in the process of training the obtained stereoscopic model of the object of interest according to the obtained image sequence containing the object of interest, the modeling may be completed based on deep learning, or may be completed based on computer vision.

When modeling is completed based on deep learning, the specific process is as follows:

s111, extracting features and initializing camera pose estimation;

extracting each frame image

Estimating the camera initialization pose corresponding to each frame of image

。

S112, segmenting the salient object;

obtaining each frame of image by utilizing pre-trained significance segmentation network

Is used for forming a mask

。

S113, model training and inference;

the goal of model training is to obtain a mesh of the model.

In the image

Random miningkEach pixel point has a position coordinate of

。

Position coordinates of each pixel point by using internal parameters

Conversion to imaging plane coordinates

。

Inputting imaging plane coordinates and optimized camera pose into neural network

Extracting the color difference characteristics between frames

(ii) a Characterizing color differences between frames

Wherein the color difference characteristic between frames

Comprises the following steps:

（1）

initializing the camera pose corresponding to the image

Input neural network

In the middle, more accurate optimized pose is obtained

. The optimized camera pose is characterized by

，

To representxThe angle of rotation of the shaft is such that,

to representyThe angle of rotation of the shaft is such that,

to representzThe rotation angle of the shaft; the initial position of the camera is

。

Wherein the optimized pose

Comprises the following steps:

（2）

according to the optimized pose

Obtaining an initial position of an optimized camera

。

Wherein, the initial position of camera after optimizing is:

（3）

in the formula (3), the reaction mixture is,Tis a function, which represents taking the position coordinates.

Initial position of self-optimized camera

。

Wherein the direction of the lightwComprises the following steps:

（4）

in the direction ofwSamplingMDot

This isMDot

Has the coordinates of

_。

Utilizing deep learning networks

Predict thisMDot

Probability at the surface of the implicit equation (i.e., implicit function TSDF).

（5）

in the formula (5), the reaction mixture is,

representing points predicted to be on the surface of the implicit equation,

is indicative of a threshold value that is,

indicating minimum compliancem. A point satisfying equation (5) can be predicted as a point on the surface of the implicit equation.

Will predict as points on the surface of the implicit equation

Send into neural rendererRObtaining the values of the predicted RGB colors

。

Wherein the predicted RGB color values

Comprises the following steps:

（6）

according to prediction

Value and acquisitionKThe color of each pixel point is calculated to obtain the square loss of the pixel difference value, so that the shape of the grid is closer to the grid of the object in the image.

Wherein the square loss of pixel differenceLComprises the following steps:

（7）

in the formula (7), the reaction mixture is,

are all indicative of the coefficients of the,

the number of the channels can be 1,

it may be in the range of 0.5,

can be 1;

representing the difference values of the pixels of the image,

difference value representing background mask

Difference from foreground mask

The sum of the total weight of the components,

indicating the difference of the edges.

In equation (7), the difference value of the image pixel

Comprises the following steps:

（8）

in the formula (8), the reaction mixture is,Pindicating all selectionskAnd (4) point.

Difference of background mask

Comprises the following steps:

（9）

in the formula (9), the reaction mixture is,

indicating all selectionskThe dots are those outside the mask.

The physical meaning of formula (9) is: for points not on the object, the estimated background mask value is as close to 0 as possible.

Difference of foreground mask

Comprises the following steps:

（10）

the physical meaning of formula (10) is: for points on the object, the estimated foreground mask value is as close to 1 as possible.

In the formulae (9) and (10),BCErepresenting a two-value cross-entropy loss,

indicating all selectionskThe dots are dots within the mask.

Difference of edge in the formula (7)

Comprises the following steps:

（11）

in the formula (11), the reaction mixture is,

representing the boundaries of the mask.

Equation (11) performs a loss enhancement on the edge points to increase the weight.

When model is inferred, neural network

Deep learning network

And neural networks

When modeling is completed based on computer vision, the specific process is as follows:

s121, extracting and matching features by adopting a visual algorithm or a deep learning algorithm;

and extracting features from the input image sequence, matching the features, and taking the matched features as input of camera pose estimation.

The input image sequence may be a color image or a grayscale image. The algorithm for extracting and matching the features can be SIFT, HAAR, ORB and other traditional visual algorithms, and can also be a deep learning algorithm.

S122, estimating the pose of the camera;

and taking the matched features as observed quantities, and estimating the pose of the camera by using an SFM (structure-from-motion algorithm, which is an off-line algorithm for three-dimensional reconstruction based on various collected disordered pictures).

S123, segmenting the salient objects in the image sequence;

and taking the camera pose as a priori, and segmenting the salient objects in the image sequence by using a salient object segmentation algorithm to serve as the input of point cloud reconstruction.

S124, reconstructing the dense point cloud;

and generating a 3D point cloud of the feature points according to the camera pose and the feature points, and obtaining dense point cloud by using a block matching algorithm.

And S125, using the reconstructed dense point cloud as the input of grid generation, and reconstructing the grid of the object by using a Poisson and other reconstruction algorithms.

And S126, finding texture coordinates corresponding to the grid vertex according to the camera pose and the image corresponding to the camera pose, and obtaining a grid map.

And S127, obtaining a three-dimensional model according to the grids of the object and the mapping of the grids.

In the step S1, the specific process of training the six-degree-of-freedom pose estimation neural network model for object pose estimation according to the acquired image sequence including the object of interest includes:

and obtaining a synthetic data set by adopting a PBR rendering method according to the three-dimensional model and the preset scene model of the object. Wherein the synthetic data set includes synthetic training data.

And obtaining a real data set by adopting a model reprojection segmentation algorithm according to the camera pose and the object pose. Wherein the real dataset comprises real training data.

In a specific embodiment, according to the three-dimensional model and the preset scene model of the object, the specific process of obtaining the synthetic data set by using the PBR rendering method includes:

reading a three-dimensional model and a preset scene model of an object;

and (4) carrying out object pose randomization, rendering camera pose randomization, material randomization and illumination randomization by adopting a PBR rendering method to obtain a series of image sequences and corresponding labeling labels. The label can be a category, a position, a pose with six degrees of freedom, and the like.

The specific process of obtaining the synthetic data set by adopting the PBR rendering method according to the three-dimensional model and the preset scene model of the object further comprises the following steps:

reading a three-dimensional model or a real image or a PBR image, and carrying out preprocessing work such as background removal on the image; synthetic images at different angles and corresponding annotation labels are generated through a deep learning Network such as GAN (generic adaptive Network) or NERF (Neural radiation Fields). The label can be a category, a position, a pose with six degrees of freedom, and the like.

In a specific embodiment, the specific process of obtaining the real data set by using the model reprojection segmentation algorithm according to the camera pose and the object pose is as follows:

and synthesizing the real data with the discrete poses into data with more dense and continuous poses, and further obtaining a real image and a corresponding label thereof. The label can be a category, a position, a pose with six degrees of freedom, and the like.

In a specific embodiment, the specific process of training the six-degree-of-freedom pose estimation neural network based on deep learning by using the synthetic training data and the real training data to obtain the six-degree-of-freedom pose estimation neural network model is as follows:

the method comprises the steps of inputting an image, 2D coordinates of a plurality of characteristic points extracted from an object, 3D coordinates corresponding to the characteristic points and an image mask.

And training the six-degree-of-freedom pose estimation neural network by adopting the following loss function to obtain a six-degree-of-freedom pose estimation neural network model.

（12）

in the formula (12), the reaction mixture is,

it is indicated that there is a loss of,

are all indicative of the coefficients of the,

a loss of classification is indicated and,

indicating that the loss of the bounding box,

which represents the loss in the 2D representation,

which represents the loss in 3D to the user,

which is indicative of a loss of the mask,

representing a projection loss.

In particular, classification loss

Comprises the following steps:

（13）

in the formula (13), the reaction mixture is,

is shown to take the first placejInformation of individual background features.

The anchor point is represented by a representation of,

an anchor point representing the background is shown,

a true value of the category is represented,

representing the proposed features of the neural network.

Loss of bounding box

Comprises the following steps:

（14）

formula (A), (B) and14) in (1),

and represents the true value of the coordinate of the detection box.

2D loss

Comprises the following steps:

（15）

in the formula (15), the reaction mixture is,

is expressed as 2DThe characteristics of the coordinates are such that,

feature points and masks representing the neural network predictions.

3D loss

Comprises the following steps:

（16）

in the formula (16), the compound represented by the formula,

is expressed by 3DThe characteristics of the coordinates are such that,

3 for representing objectsDFeature(s)The value of the point true is shown,

feature points and masks representing the neural network predictions.

Mask loss

Comprises the following steps:

（17）

in the formula (17), the compound represented by the formula (I),

first to show the prospectiThe characteristics of the device are as follows,

indicating taking the backgroundjThe characteristics of the device are as follows,fgthe representation of the foreground is performed,bgrepresenting the background.

Loss of projection

Comprises the following steps:

（18）

in the formula (18), the reaction mixture,

feature points and masks representing the neural network predictions.

In the step S2, the pose calculation and rendering of the object of interest may be implemented by the mobile terminal, or may be implemented by mixing the mobile terminal and the cloud server.

The mode of realizing the pose calculation and the rendering of the interested object through the mobile terminal is suitable for the condition that the user-defined models are few. Before tracking is started, the cloud server is accessed only once, and after the object model, the deep learning model, the feature database and the like of the user are downloaded, other calculations are carried out on the mobile terminal. The mobile terminal reads camera data from the equipment, obtains the pose of an object by detecting or identifying the neural network and estimating the neural network by the pose of six degrees of freedom, and then renders the content to be rendered according to the pose.

The mode of realizing the pose calculation and rendering of the interested object by mixing the mobile terminal and the cloud server is suitable for the condition that more user-defined models are available, and is a general object pose tracking solution. In the tracking process, the cloud server needs to be accessed and resources downloaded one or more times. The mobile terminal inputs an image sequence and outputs an object pose and a rendered image.

The main flow of the mode is as follows: inputting an image sequence in the mobile terminal, performing significance detection on each frame of image, uploading a significance detection area to the cloud server for retrieval, obtaining object information and a depth learning model related to the object information, loading the object information and the depth learning model to the mobile terminal for pose estimation, then obtaining an object pose, and rendering content to be rendered according to the pose.

The object anchoring method provided by the application adopts a modeling mode of unsupervised deep learning, only a small number of feature points are needed to be provided, the initial camera posture is calculated, modeling can be carried out, and the feature points on the object are not needed, so that modeling can be carried out on a pure-color object or an object with less texture.

According to the object anchoring method, the model which is used for carrying out recognition and 3D position and posture tracking by using the 2D image is trained by adopting synthetic data synthesis and real data synthesis, the problem that inaccuracy, illumination, environment and the like have great influence on the algorithm when a user self-defines object recognition and 3D tracking can be solved, and then the method for obtaining and displaying the self-defined object information of the mobile terminal is realized, and the information is displayed and corresponds to the 3D position and posture of the object.

Based on the object anchoring method provided by the application, the application also provides an object anchoring system provided by the application.

Fig. 2 is a schematic structural diagram of an object anchoring system according to an embodiment of the present application.

As shown in fig. 2, the object anchoring system provided in the embodiment of the present application includes a cloud training unit 1 and an object pose calculation and rendering unit 2. The cloud training unit 1 is used for obtaining a three-dimensional model of an interested object and a six-degree-of-freedom pose estimation neural network model for object posture estimation through training according to an acquired image sequence containing the interested object. The object pose calculation and rendering unit 2 is configured to perform pose estimation on the object of interest according to the three-dimensional model of the object of interest and the six-degree-of-freedom pose estimation neural network model for object pose estimation, and superimpose virtual information on the object of interest to implement rendering of the object of interest.

In the present embodiment, as shown in fig. 3, the cloud training unit 1 includes a modeling unit 11, a synthetic training data generating unit 12, a real training data generating unit 13, and a training algorithm unit 14.

The modeling unit 11 is configured to train a three-dimensional model of the object of interest according to the acquired image sequence including the object of interest.

The synthetic training data generating unit 12 is configured to obtain a synthetic data set according to the three-dimensional model of the object and the preset scene model, where the synthetic data set includes synthetic training data.

The real training data generating unit 13 is configured to obtain a real data set according to the camera pose and the object pose, where the real data set includes real training data.

The training algorithm unit 14 is configured to train the pose estimation neural network based on six degrees of freedom for deep learning according to the synthetic training data and the real training data, so as to obtain a pose estimation neural network model based on six degrees of freedom.

In a specific embodiment, the modeling unit 11 comprises a deep learning based modeling unit and a computer vision based modeling unit.

As shown in fig. 4, the input of the deep learning based modeling unit is a sequence of images, the output of which is a deep learning model. And inputting the multiple images into the deep learning model for inference to obtain grids and textures.

The modeling process of the deep learning based modeling unit is the same as the content of the above steps S111-S113, and is not repeated here.

As shown in fig. 5, the input of the computer vision based modeling unit is a sequence of images, the output of which is a modeled stereo model.

The modeling process of the computer vision-based modeling unit is the same as that of the above steps S121 to S127, and is not repeated here.

In the above-described embodiment, as shown in fig. 6 and 7, the synthetic training data generating unit 12 includes a PBR (physical-Based Rendering) Rendering unit. The PBR rendering unit 121 reads the stereo model and the preset scene model of the object by using the render frames such as the blend, unity, and the like, and performs object pose randomization, rendering camera pose randomization, material randomization, and illumination randomization to obtain a series of image sequences and corresponding annotation tags. The label can be a category, a position, a pose with six degrees of freedom, and the like.

As shown in fig. 6 and 8, the synthetic training data generating unit 12 further includes a synthetic image reality migrating unit 122, where the synthetic image reality migrating unit 122 reads the stereo model or the real image or the PBR image, performs preprocessing such as background removal on the image, and then generates synthetic images at different angles and their corresponding label labels through a deep learning Network such as GAN (generic advanced Network) or NERF (Neural radiation Fields). The label can be a category, a position, a pose with six degrees of freedom, and the like.

In the above embodiment, as shown in fig. 9, the real training data generating unit 13 includes the model reprojection segmentation algorithm unit 131. The model re-projection segmentation algorithm unit 131 obtains the image sequence, the camera pose and the object pose, and segments the object in the real image.

The real training data generating unit 13 further includes an inter-frame data synthesizing unit 132, which is configured to synthesize the real data with discrete poses into data with more dense and continuous poses, so as to obtain a real image and its corresponding label. The label can be a category, a position, a pose with six degrees of freedom, and the like.

In the above embodiment, the training algorithm unit 14 trains the six-degree-of-freedom pose estimation neural network based on deep learning according to the synthetic training data and the real training data.

And training a six-degree-of-freedom pose estimation neural network by using an end-to-end method. And the object detection and the pose estimation with six degrees of freedom can be finished by one network. The six-degree-of-freedom pose estimation neural network inputs 2D coordinates of a plurality of characteristic points extracted from an image and an object, 3D coordinates corresponding to the characteristic points, and an image mask. The network structure is shown in figure 9 of the drawings,

a neural network of a first stage for outputting a detection box;

a second stage neural network that is used to compute 2D and 3D keypoints for the object. The cross entropy of the mask is mainly used for removing the interference of background features, 2D key points are regressed in a Gaussian thermodynamic diagram mode, 3D key points need to be normalized to be 0-1 based on the initial posture of an object, and projection errors are used for guaranteeing 2D and 3D relationsConsistency of key points.

The loss functions required for training the pose estimation neural network with six degrees of freedom are the same as those in the above equations (12) to (18), and are not described in detail here.

In the above embodiments, the object pose calculation and rendering unit 2 may be implemented by a mobile terminal, or may be implemented by mixing the mobile terminal and a cloud server.

As shown in fig. 10, the mode in which the object pose calculation and rendering unit 2 is implemented by the mobile terminal is suitable for the case where there are few custom models for the user. Before tracking is started, the cloud server is accessed only once, and after the object model, the deep learning model, the feature database and the like of the user are downloaded, other calculations are carried out on the mobile terminal. The mobile terminal reads camera data from the equipment, obtains the pose of an object by detecting or identifying the neural network and estimating the neural network by the pose of six degrees of freedom, and then renders the content to be rendered according to the pose.

As shown in fig. 11, the mode of the object pose calculation and rendering unit 2 implemented by mixing the mobile terminal and the cloud server is suitable for the case of many user-defined models of users, and is a solution for tracking the object pose in general. In the tracking process, the cloud server needs to be accessed and resources downloaded one or more times. The mobile terminal inputs an image sequence and outputs an object pose and a rendered image.

It should be noted that: the object anchoring system provided in the above embodiment is only illustrated by the division of the above program modules, and in practical applications, the above processing may be distributed to different program modules according to needs, that is, the internal structure of the object anchoring system is divided into different program modules to complete all or part of the above-described processing. In addition, the embodiments of the object anchoring system and the object anchoring method provided by the above embodiments belong to the same concept, and specific implementation processes thereof are described in the embodiments of the methods for details, which are not described herein again.

In an exemplary embodiment, the present application further provides a storage medium, which is a computer readable storage medium, for example, a memory including a computer program, which is executable by a processor to perform the steps of the foregoing object anchoring method.

The embodiments of the present application described above may be implemented in various hardware, software code, or a combination of both. For example, the embodiments of the present application may also be program code for executing the above-described method in a data signal processor. The present application may also relate to various functions performed by a computer processor, digital signal processor, microprocessor, or field programmable gate array. The processor described above may be configured in accordance with the present application to perform certain tasks by executing machine-readable software code or firmware code that defines certain methods disclosed herein. Software code or firmware code may be developed in different programming languages and in different formats or forms. Software code may also be compiled for different target platforms. However, different code styles, types, and languages of software code and other types of configuration code for performing tasks according to the present application do not depart from the spirit and scope of the present application.

The foregoing is merely an illustrative embodiment of the present application, and any equivalent changes and modifications made by those skilled in the art without departing from the spirit and principles of the present application shall fall within the protection scope of the present application.

Claims

1. A method of anchoring an object, comprising the steps of:

training according to the acquired image sequence containing the interested object to obtain a three-dimensional model of the interested object and a six-degree-of-freedom pose estimation neural network model for estimating the pose of the object; in the process of training and obtaining the three-dimensional model of the interested object according to the obtained image sequence containing the interested object, the modeling is completed based on the deep learning or the computer vision, and the process of completing the modeling based on the deep learning is as follows:

carrying out model training and inference to obtain a grid of the model;

the process of completing modeling based on computer vision is as follows:

estimating the pose of the camera;

segmenting salient objects in the image sequence;

reconstructing the dense point cloud;

obtaining a three-dimensional model according to the grids of the object and the mapping of the grids;

the specific process of training the six-degree-of-freedom pose estimation neural network model for object pose estimation according to the acquired image sequence containing the object of interest comprises the following steps:

training a six-degree-of-freedom pose estimation neural network based on deep learning by utilizing the synthetic training data and the real training data to obtain a six-degree-of-freedom pose estimation neural network model;

2. The method for anchoring an object according to claim 1, wherein said process of performing model training and inference is:

in the image

Up random acquisition

Individual pixel point, position coordinate of each pixel point

；

Position coordinates of each pixel point by using internal parameters

Conversion to imaging plane coordinates

；

Extracting the color difference characteristics between frames

(ii) a Characterizing color differences between frames

Adding the color difference to an original image to compensate the color difference between frames;

wherein the color difference characteristic between frames

Comprises the following steps:

，

in the formula (I), the compound is shown in the specification,

representing an image truth value;

initializing the camera pose corresponding to the image

Input neural network

In the method, the optimized pose is obtained

；

Wherein the optimized pose

Comprises the following steps:

；

according to the optimized pose

Obtaining an initial position of an optimized camera

；

Wherein, the initial position of camera after optimizing is:

；

initial position of self-optimized camera

；

Wherein the direction of the lightwComprises the following steps:

；

in the direction ofwSamplingMDot

This isMThe coordinates of the points are

；

Utilizing deep learning networks

Predict thisMDot

Probability at the surface of the implicit equation;

；

in the formula (I), the compound is shown in the specification,

representing points predicted to be on the surface of the implicit equation,

a threshold value is indicated which is indicative of,

indicating minimum compliancem；

Will predict as points on the surface of the implicit equation

Send into neural rendererRObtaining the values of the predicted RGB colors

；

Wherein the predicted RGB color values

Comprises the following steps:

；

according to prediction

wherein the square loss of pixel differenceLComprises the following steps:

；

in the formula (I), the compound is shown in the specification,

all represent coefficients;

representing the difference values of the pixels of the image,

difference value representing background mask

Difference from foreground mask

The sum of the total weight of the components,

representing a difference of the edges;

in which the difference of the image pixels

Comprises the following steps:

；

in the formula (I), the compound is shown in the specification,Pindicating all selectionskThe point of the light beam is the point,

representing a predicted color value;

difference of background mask

Comprises the following steps:

；

in the formula (I), the compound is shown in the specification,

indicating all selectionskOut of the dots, dots outside the mask;

difference of foreground mask

Comprises the following steps:

；

indicating all selectionskOne of the dots within the mask;

difference of edge

Comprises the following steps:

；

in the formula (I), the compound is shown in the specification,

representing the boundaries of the mask;

when model is inferred, neural network

Deep learning network

And neural networks

3. The object anchoring method according to claim 1, wherein the specific process of obtaining the synthetic dataset by using the PBR rendering method according to the stereoscopic model and the preset scene model of the object is:

reading a three-dimensional model and a preset scene model of an object;

4. The object anchoring method according to claim 1, wherein the specific process of obtaining the real dataset by using the model reprojection segmentation algorithm according to the camera pose and the object pose is as follows:

5. The object anchoring method according to claim 1, wherein the training of the six-degree-of-freedom pose estimation neural network based on deep learning by using the synthetic training data and the real training data to obtain the six-degree-of-freedom pose estimation neural network model comprises:

；

in the formula (I), the compound is shown in the specification,

it is indicated that there is a loss of,

are all indicative of the coefficients of the,

a loss of classification is indicated and,

indicating that the loss of the bounding box,

which represents the loss in the 2D representation,

which represents the loss in 3D to the user,

which is indicative of a loss of the mask,

representing a projection loss;

wherein the classification is lost

Comprises the following steps:

；

in the formula (I), the compound is shown in the specification,

is shown to take the first placejInformation of individual background features;

the anchor point is represented by a representation of,

an anchor point representing the background is shown,

a true value of the category is represented,

representing features proposed by a neural network;

loss of bounding box

Comprises the following steps:

；

in the formula (I), the compound is shown in the specification,

representing the true value of the coordinate of the detection frame;

2D loss

Comprises the following steps:

；

in the formula (I), the compound is shown in the specification,

is expressed as 2DThe characteristics of the coordinates are such that,

2 for representing an objectDA true value of the characteristic point;

3D loss

Comprises the following steps:

；

in the formula (I), the compound is shown in the specification,

is expressed by 3DThe characteristics of the coordinates are such that,

3 for representing an objectDA true value of the characteristic point;

mask loss

Comprises the following steps:

；

in the formula (I), the compound is shown in the specification,

first to show the prospectiThe characteristics of the device are as follows,

loss of projection

Comprises the following steps:

；

in the formula (I), the compound is shown in the specification,

feature points and masks representing the neural network predictions.

6. The object anchoring method according to claim 1, wherein the rendering of the object of interest is implemented by a mobile terminal or by a mobile terminal mixed with a cloud server;

the process realized by the mobile terminal is as follows:

rendering the content to be rendered according to the pose of the object;

estimating the position and the attitude of an object at the mobile terminal to obtain the position and the attitude of the object;

and rendering the content to be rendered according to the pose of the object.

7. An object anchoring system is characterized by comprising a cloud training unit and an object pose calculation and rendering unit;

8. A storage medium having stored thereon an executable program which, when invoked, performs the steps in the object anchoring method according to any one of claims 1 to 6.